Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Execution workflows / Utils): making RefSeq annotations work for QAPA and PAQR. #457

Open
mrgazzara opened this issue Dec 20, 2022 · 3 comments
Labels
blocked Something prevents progress enhancement New feature or request

Comments

@mrgazzara
Copy link
Collaborator

In trying to run the more conservative RefSeq annotation on certain tools, I've run in to an issue where the tools assume specific attributes (e.g. Ensembl type attributes for QAPA or Ensembl/Gencode for PAQR) to be present in order to build their annotation.

I need to check a bit deeper to make sure I gather all the requirements. Not sure if it is best to make a Util script to transform a RefSeq gtf so it contains theses required Ensembl/Gencode attributes or if we should add an option within each EWF to handle this.

@mrgazzara mrgazzara added enhancement New feature or request blocked Something prevents progress labels Dec 20, 2022
@dominikburri
Copy link
Collaborator

Hi @mrgazzara, I try to look into PAQR, but it's not quite clear yet what the adaptation of refseq to gencode needs to look like.
The thing I find in the config is this:

biotype_key: transcript_type

which is for the gencode annotation and could be replaced by gbkey (=mRNA) or gene_biotype (but this is only on type, col 3 in gff). Those seem to be the replacement keys.

But you mentioned that you would need to change code within. Can you specify what exactly? What is assumed in the gtf that is not present in the refseq annotation?

What I see in tandem pas, is that it looks for gene_id: https://github.com/zavolanlab/tandem-pas/blob/98478da7f0f2cb15e8df777a405840261ed055f0/scripts/mz-select-pas-subset.pl#L130
which is not available in refseq.
It could be replaced by ID=gene- though - assuming gff format. Or maybe by checking column 3 (gene, exon, CDS, mRNA/transcript [refseq,gencode]).
transcript_id is available in both gencode and refseq.

@ninsch3000
Copy link
Collaborator

Hey @mrgazzara,
For PAQR I can confirm what @dominikburri writes above, the config.PAQR.yaml has to be adapted ("transcript_type" for GENCODE, "transcript_biotype" for ensembl). I couldn't find the refseq annotation file you're using in APAeval, but just searching for one myself I came across a gtf here. At a first glance, the fields "gene_id" and "transcript_biotype" seem to be present, so we should theoretically be able to adapt PAQR.

Could you maybe specify in more detail what seems to be the problem? Or provide your refseq annotation somewhere?

@mrgazzara
Copy link
Collaborator Author

Sorry for the delays. Getting back to APAeval now.

The RefSeq annotations Farica and I used for running/re-running the other tools were from UCSC genome browser. Taking another look they were actually the "refGene" annotations, which seem to be RefSeq but just for annotated protein coding and non-coding genes (accessions with NM_* and NR_*, see documentation here. These files can be downloaded through UCSC ftp for human hg38.refGene.gtf.gz or mouse mm10.refGene.gtf.gz.

This file format seemed to be to most closely match the ones DaPars used as their recommended input. I can try the file CJ linked to to see if it works with PAQR / QAPA (I think it might) and we can either use this as is, subset this file, or adapt the refGene gtf to have the necessary fields.

@ninsch3000 ninsch3000 changed the title Execution Workflows / Utils: making RefSeq annotations work for QAPA and PAQR. feat [Execution workflows / Utils]: making RefSeq annotations work for QAPA and PAQR. Aug 9, 2023
@ninsch3000 ninsch3000 changed the title feat [Execution workflows / Utils]: making RefSeq annotations work for QAPA and PAQR. feat(Execution workflows / Utils): making RefSeq annotations work for QAPA and PAQR. Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Something prevents progress enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants