Coordinates of Pfam domains are not correctly reported due to splicing #34

dariober · 2023-01-06T12:53:18Z

If I'm not mistaken, there is a problem with the integration of Pfam domains into the final GFF file. It seems that when a transcript contains introns, the coordinates of the domain are not correctly spliced and offset.

Below is an example:

There are 3 Pfam domains in a transcript with multiple exons. The first domain, PF02861, has coordinates 277134-277217 and sits between the first intron and the second CDS (277188-277561). The 3rd domain, PF00004, is fully contained in an intron. I think a protein domain can include only CDS regions (possibly more than one).

These are all the features of this gene:

grep 'Tgondii_000009000.1' scaffold.out.gff3 
TgRH.10_pilon_pilon	AUGUSTUS	mRNA	276645	280440	0.92	+	.	ID=Tgondii_000009000.1;Parent=Tgondii_000009000
TgRH.10_pilon_pilon	AUGUSTUS	CDS	276645	276885	1	+	0	ID=Tgondii_000009000.1:CDS:1;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	277188	277561	0.93	+	2	ID=Tgondii_000009000.1:CDS:2;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	278836	279009	1	+	0	ID=Tgondii_000009000.1:CDS:3;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	279671	279944	1	+	0	ID=Tgondii_000009000.1:CDS:4;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	280220	280440	0.99	+	2	ID=Tgondii_000009000.1:CDS:5;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	.	polypeptide	276645	280440	.	+	.	ID=Tgondii_000009000.1:pep;Derives_from=Tgondii_000009000.1;orthologous_to=Tgondii_000248400.1,Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0005524,GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277134	277217	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF02861;signature_desc=Clp amino terminal domain%2C pathogenicity island component;Ontology_term=GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277254	277388	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF02861;signature_desc=Clp amino terminal domain%2C pathogenicity island component;Ontology_term=GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277572	277847	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF00004;signature_desc=ATPase family associated with various cellular activities (AAA);Ontology_term=GO:0005524
TgRH.10_pilon_pilon	.	polypeptide	280746	285473	.	+	.	ID=Tgondii_000009100.1:pep;Derives_from=Tgondii_000009100.1;orthologous_to=Tgondii_000248400.1,Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1
TgRH.1_pilon_pilon	.	polypeptide	7742535	7745808	.	+	.	ID=Tgondii_000248400.1:pep;Derives_from=Tgondii_000248400.1;orthologous_to=Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0019538,GO:0016887,GO:0005524
TgRH.8_pilon_pilon	.	polypeptide	3707700	3716669	.	+	.	ID=Tgondii_000874000.1:pep;Derives_from=Tgondii_000874000.1;orthologous_to=Tgondii_000248400.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0019538,GO:0005524,GO:0016887,GO:0008134,GO:0006355

If any useful, this script https://github.com/glaParaBio/genomeAnnotationPipeline/blob/master/scripts/add_hmmsearch_to_gff.py should properly integrate the output of hmmsearch/hmmscan into a gff (not extensively tested!)

The text was updated successfully, but these errors were encountered:

haessar pinned this issue Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinates of Pfam domains are not correctly reported due to splicing #34

Coordinates of Pfam domains are not correctly reported due to splicing #34

dariober commented Jan 6, 2023

Coordinates of Pfam domains are not correctly reported due to splicing #34

Coordinates of Pfam domains are not correctly reported due to splicing #34

Comments

dariober commented Jan 6, 2023