-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89
Comments
gene_name is None because there was no symbol associated with the transcript, If you go to the Ensembl site for ENST00000666152 you can see that it has "-" for Name (ie none) Can you please specify the URL of the data file you used to produce things? For instance here is the transcript on the REST server: https://cdot.cc/transcript/ENST00000344691.8 Has an URL: ftp://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz If we download and inspect this file, you can see there is no symbol associated with it: Transcript:
Gene:
So my questions are:
Sure, you can just loop over the transcripts in the JSON then add them to eg
I made a lookup similar to this for the REST server: |
Hi Dave, thanks for your response. To answer your questions:
import pyranges as pr
gr = pr.read_gtf("../../gencode.v38.annotation.gtf") # Change path as needed
ex = gr.df
ex[ex["gene_name"] == "RP5-899B16.3"] When filtering this output, I saw the transcript ENST00000666152.1 linked to RP5-899B16.3 (in the transcript_id column).
A group of volunteers from Roswell Park Comprehensive Cancer Center, five in all, donated their blood for research purposes at the beginning of the Human Genome Project. Blood from two of these volunteers, a man and a woman known only as RP5 and RP11, was used to create the initial framework for sequencing the human genome. This framework, referred to as bacterial artificial chromosomes, or BACs, are fragments of DNA from 250,000 to 500,000 base pairs in size that researchers used to align the final DNA sequence of the human genome. |
We don't use GENCODE. cdot and UTA were originally designed to resolve HGVS, which use RefSeq and Ensembl transcripts in humans I have raised an issue #90 to investigate GENCODE to see if it's worth using as a source That being said, if you want to use cdot as a JSON annotation file, to save parsing GTFs etc, fair enough - you could try running the Here's the code to generate the json.gz files - hopefully you can figure out how to run it: https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch38.sh |
@davmlaw
Hi Dave, I am currently working on a project in Alex Wagner's laboratory aimed at standardizing output from different gene fusion detection algorithms. A central component of this work is using a transcript-based model to model the transcript junctions for each of the partners in a fusion (see our specification for further reference).
We are currently using UTA to get this transcript data, but have observed several cases where an outputted fusion may report genes from bacterial artificial chromosomes as a fusion partner (e.g. RP5-899B16.3 and CTD-2055G21.1). We are considering using cdot in addition to UTA to help get transcript data for gene symbols that may not exist in the recent UTA release.
By processing earlier versions of GENCODE GTFs, such as version 38, we were able to extract the transcripts linked to these gene symbols. However, when querying the matched transcripts using cdot, the
gene_name
attribute was None. For example, for the gene RP5-899B16.3 we observed:We were wondering why the
gene_name
attribute returned None? Also, would cdot be appropriate for this use case (getting a list of transcripts associated with a gene symbol)? Thank you for help.The text was updated successfully, but these errors were encountered: