Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

Open
jarbesfeld opened this issue Jan 9, 2025 · 3 comments

Comments

@jarbesfeld
Copy link

jarbesfeld commented Jan 9, 2025

@davmlaw

Hi Dave, I am currently working on a project in Alex Wagner's laboratory aimed at standardizing output from different gene fusion detection algorithms. A central component of this work is using a transcript-based model to model the transcript junctions for each of the partners in a fusion (see our specification for further reference).

We are currently using UTA to get this transcript data, but have observed several cases where an outputted fusion may report genes from bacterial artificial chromosomes as a fusion partner (e.g. RP5-899B16.3 and CTD-2055G21.1). We are considering using cdot in addition to UTA to help get transcript data for gene symbols that may not exist in the recent UTA release.

By processing earlier versions of GENCODE GTFs, such as version 38, we were able to extract the transcripts linked to these gene symbols. However, when querying the matched transcripts using cdot, the gene_name attribute was None. For example, for the gene RP5-899B16.3 we observed:

{'id': 'ENST00000666152.1',
 'chrom': 'NC_000006.12',
 'start': 139938863,
 'end': 139991094,
 'strand': '-',
 'cds_start': 139991094,
 'cds_end': 139991094,
 'gene_name': None,
 'exons': [[139938863, 139939458],
  [139978011, 139978404],
  [139978621, 139978873],
  [139990992, 139991094]]}

We were wondering why the gene_name attribute returned None? Also, would cdot be appropriate for this use case (getting a list of transcripts associated with a gene symbol)? Thank you for help.

@jarbesfeld jarbesfeld changed the title Accessing transcript data for genes from bacteria artificial chromosomes (BACs) Accessing transcript data for genes from bacterial artificial chromosomes (BACs) Jan 9, 2025
@davmlaw
Copy link
Contributor

davmlaw commented Jan 13, 2025

gene_name is None because there was no symbol associated with the transcript, If you go to the Ensembl site for ENST00000666152 you can see that it has "-" for Name (ie none)

Can you please specify the URL of the data file you used to produce things? For instance here is the transcript on the REST server:

https://cdot.cc/transcript/ENST00000344691.8

Has an URL:

ftp://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz

If we download and inspect this file, you can see there is no symbol associated with it:

Transcript:

$ zgrep ENST00000666152 Homo_sapiens.GRCh38.112.gff3.gz 
6	havana_tagene	lnc_RNA	139938864	139991094	.	-	.	ID=transcript:ENST00000666152;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic,Ensembl_canonical;transcript_id=ENST00000666152;version=1
6	havana_tagene	exon	139938864	139939458	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003855643;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003855643;rank=4;version=1
6	havana_tagene	exon	139978012	139978404	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003858874;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003858874;rank=3;version=1
6	havana_tagene	exon	139978622	139978873	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003863078;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003863078;rank=2;version=1
6	havana_tagene	exon	139990993	139991094	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003870475;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003870475;rank=1;version=1

Gene:

$ zgrep ENSG00000287820 Homo_sapiens.GRCh38.112.gff3.gz 
6	havana_tagene	ncRNA_gene	139938864	139991094	.	-	.	ID=gene:ENSG00000287820;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000287820;logic_name=havana_tagene_homo_sapiens;version=1
6	havana_tagene	lnc_RNA	139938864	139991094	.	-	.	ID=transcript:ENST00000666152;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic,Ensembl_canonical;transcript_id=ENST00000666152;version=1
6	havana_tagene	lnc_RNA	139976713	139991063	.	-	.	ID=transcript:ENST00000668850;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic;transcript_id=ENST00000668850;version=1

So my questions are:

  • Why do you think "RP5-899B16.3" is associated with ENST00000666152.1?
  • I don't know much about BACs and I don't understand RP5-899B16.3 and CTD-2055G21.1 - is this something added onto the end of a gene eg the old obsolete symbol RP5

would cdot be appropriate for this use case (getting a list of transcripts associated with a gene symbol

Sure, you can just loop over the transcripts in the JSON then add them to eg

transcripts_per_gene = defaultdict(set)

I made a lookup similar to this for the REST server:

https://cdot.cc/transcripts/gene/BRCA2

@jarbesfeld
Copy link
Author

jarbesfeld commented Jan 13, 2025

Hi Dave, thanks for your response. To answer your questions:

  1. I downloaded the GENCODE v38 comprehensive annotation GTF at this link and unzipped the file (1.46 GB). Using the pyranges library, I was able to filter for rows in the GTF that have the gene RP5-899B16.3
import pyranges as pr

gr = pr.read_gtf("../../gencode.v38.annotation.gtf") # Change path as needed
ex = gr.df
ex[ex["gene_name"] == "RP5-899B16.3"]

When filtering this output, I saw the transcript ENST00000666152.1 linked to RP5-899B16.3 (in the transcript_id column).
Screenshot 2025-01-13 at 9 12 10 AM
Since this gene symbol only exists in the v38 GENCODE GTF, I was wondering if this data was not being accessed in cdot?

  1. So RP5-899B16.3 does not reference the obsolete gene symbol RP5 but an artificial chromosome that was generated during the Human Genome Project:

A group of volunteers from Roswell Park Comprehensive Cancer Center, five in all, donated their blood for research purposes at the beginning of the Human Genome Project. Blood from two of these volunteers, a man and a woman known only as RP5 and RP11, was used to create the initial framework for sequencing the human genome. This framework, referred to as bacterial artificial chromosomes, or BACs, are fragments of DNA from 250,000 to 500,000 base pairs in size that researchers used to align the final DNA sequence of the human genome.
Source

@davmlaw
Copy link
Contributor

davmlaw commented Jan 14, 2025

We don't use GENCODE. cdot and UTA were originally designed to resolve HGVS, which use RefSeq and Ensembl transcripts in humans

I have raised an issue #90 to investigate GENCODE to see if it's worth using as a source

That being said, if you want to use cdot as a JSON annotation file, to save parsing GTFs etc, fair enough - you could try running the cdot_json.py gtf_to_json on the Gencode GTF and maybe that would help you?

Here's the code to generate the json.gz files - hopefully you can figure out how to run it: https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch38.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants