Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinates of Pfam domains are not correctly reported due to splicing #34

Open
dariober opened this issue Jan 6, 2023 · 0 comments
Open

Comments

@dariober
Copy link

dariober commented Jan 6, 2023

If I'm not mistaken, there is a problem with the integration of Pfam domains into the final GFF file. It seems that when a transcript contains introns, the coordinates of the domain are not correctly spliced and offset.

Below is an example:

image

There are 3 Pfam domains in a transcript with multiple exons. The first domain, PF02861, has coordinates 277134-277217 and sits between the first intron and the second CDS (277188-277561). The 3rd domain, PF00004, is fully contained in an intron. I think a protein domain can include only CDS regions (possibly more than one).

These are all the features of this gene:

grep 'Tgondii_000009000.1' scaffold.out.gff3 
TgRH.10_pilon_pilon	AUGUSTUS	mRNA	276645	280440	0.92	+	.	ID=Tgondii_000009000.1;Parent=Tgondii_000009000
TgRH.10_pilon_pilon	AUGUSTUS	CDS	276645	276885	1	+	0	ID=Tgondii_000009000.1:CDS:1;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	277188	277561	0.93	+	2	ID=Tgondii_000009000.1:CDS:2;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	278836	279009	1	+	0	ID=Tgondii_000009000.1:CDS:3;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	279671	279944	1	+	0	ID=Tgondii_000009000.1:CDS:4;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	AUGUSTUS	CDS	280220	280440	0.99	+	2	ID=Tgondii_000009000.1:CDS:5;Parent=Tgondii_000009000.1
TgRH.10_pilon_pilon	.	polypeptide	276645	280440	.	+	.	ID=Tgondii_000009000.1:pep;Derives_from=Tgondii_000009000.1;orthologous_to=Tgondii_000248400.1,Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0005524,GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277134	277217	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF02861;signature_desc=Clp amino terminal domain%2C pathogenicity island component;Ontology_term=GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277254	277388	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF02861;signature_desc=Clp amino terminal domain%2C pathogenicity island component;Ontology_term=GO:0019538
TgRH.10_pilon_pilon	Pfam	protein_match	277572	277847	.	+	.	Parent=Tgondii_000009000.1:pep;Name=PF00004;signature_desc=ATPase family associated with various cellular activities (AAA);Ontology_term=GO:0005524
TgRH.10_pilon_pilon	.	polypeptide	280746	285473	.	+	.	ID=Tgondii_000009100.1:pep;Derives_from=Tgondii_000009100.1;orthologous_to=Tgondii_000248400.1,Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1
TgRH.1_pilon_pilon	.	polypeptide	7742535	7745808	.	+	.	ID=Tgondii_000248400.1:pep;Derives_from=Tgondii_000248400.1;orthologous_to=Tgondii_000874000.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0019538,GO:0016887,GO:0005524
TgRH.8_pilon_pilon	.	polypeptide	3707700	3716669	.	+	.	ID=Tgondii_000874000.1:pep;Derives_from=Tgondii_000874000.1;orthologous_to=Tgondii_000248400.1,TGME49_257990_t26_1,TGME49_275690_t26_1,TGME49_268650_t26_1,Tgondii_000009000.1,Tgondii_000009100.1;ortholog_cluster=ORTHOMCL32;product=term%3Dheat shock protein 101%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_257990_t26_1%3Bis_preferred%3Dtrue,term%3Dchaperone clpB protein%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_268650_t26_1%3Brank%3D1,term%3DClpB%2C putative%3Bevidence%3DIEA%3Bwith%3DGeneDB:TGME49_275690_t26_1%3Brank%3D1;Ontology_term=GO:0019538,GO:0005524,GO:0016887,GO:0008134,GO:0006355

If any useful, this script https://github.com/glaParaBio/genomeAnnotationPipeline/blob/master/scripts/add_hmmsearch_to_gff.py should properly integrate the output of hmmsearch/hmmscan into a gff (not extensively tested!)

@haessar haessar pinned this issue Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant