Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

jarbesfeld · 2025-01-09T18:36:14Z

Hi Dave, I am currently working on a project in Alex Wagner's laboratory aimed at standardizing output from different gene fusion detection algorithms. A central component of this work is using a transcript-based model to model the transcript junctions for each of the partners in a fusion (see our specification for further reference).

We are currently using UTA to get this transcript data, but have observed several cases where an outputted fusion may report genes from bacterial artificial chromosomes as a fusion partner (e.g. RP5-899B16.3 and CTD-2055G21.1). We are considering using cdot in addition to UTA to help get transcript data for gene symbols that may not exist in the recent UTA release.

By processing earlier versions of GENCODE GTFs, such as version 38, we were able to extract the transcripts linked to these gene symbols. However, when querying the matched transcripts using cdot, the gene_name attribute was None. For example, for the gene RP5-899B16.3 we observed:

{'id': 'ENST00000666152.1',
 'chrom': 'NC_000006.12',
 'start': 139938863,
 'end': 139991094,
 'strand': '-',
 'cds_start': 139991094,
 'cds_end': 139991094,
 'gene_name': None,
 'exons': [[139938863, 139939458],
  [139978011, 139978404],
  [139978621, 139978873],
  [139990992, 139991094]]}

We were wondering why the gene_name attribute returned None? Also, would cdot be appropriate for this use case (getting a list of transcripts associated with a gene symbol)? Thank you for help.

The text was updated successfully, but these errors were encountered:

davmlaw · 2025-01-13T06:52:58Z

gene_name is None because there was no symbol associated with the transcript, If you go to the Ensembl site for ENST00000666152 you can see that it has "-" for Name (ie none)

Can you please specify the URL of the data file you used to produce things? For instance here is the transcript on the REST server:

https://cdot.cc/transcript/ENST00000344691.8

Has an URL:

ftp://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz

If we download and inspect this file, you can see there is no symbol associated with it:

Transcript:

$ zgrep ENST00000666152 Homo_sapiens.GRCh38.112.gff3.gz 
6	havana_tagene	lnc_RNA	139938864	139991094	.	-	.	ID=transcript:ENST00000666152;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic,Ensembl_canonical;transcript_id=ENST00000666152;version=1
6	havana_tagene	exon	139938864	139939458	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003855643;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003855643;rank=4;version=1
6	havana_tagene	exon	139978012	139978404	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003858874;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003858874;rank=3;version=1
6	havana_tagene	exon	139978622	139978873	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003863078;constitutive=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003863078;rank=2;version=1
6	havana_tagene	exon	139990993	139991094	.	-	.	Parent=transcript:ENST00000666152;Name=ENSE00003870475;constitutive=0;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00003870475;rank=1;version=1

Gene:

$ zgrep ENSG00000287820 Homo_sapiens.GRCh38.112.gff3.gz 
6	havana_tagene	ncRNA_gene	139938864	139991094	.	-	.	ID=gene:ENSG00000287820;biotype=lncRNA;description=novel transcript;gene_id=ENSG00000287820;logic_name=havana_tagene_homo_sapiens;version=1
6	havana_tagene	lnc_RNA	139938864	139991094	.	-	.	ID=transcript:ENST00000666152;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic,Ensembl_canonical;transcript_id=ENST00000666152;version=1
6	havana_tagene	lnc_RNA	139976713	139991063	.	-	.	ID=transcript:ENST00000668850;Parent=gene:ENSG00000287820;biotype=lncRNA;tag=basic;transcript_id=ENST00000668850;version=1

So my questions are:

Why do you think "RP5-899B16.3" is associated with ENST00000666152.1?
I don't know much about BACs and I don't understand RP5-899B16.3 and CTD-2055G21.1 - is this something added onto the end of a gene eg the old obsolete symbol RP5

would cdot be appropriate for this use case (getting a list of transcripts associated with a gene symbol

Sure, you can just loop over the transcripts in the JSON then add them to eg

transcripts_per_gene = defaultdict(set)

I made a lookup similar to this for the REST server:

https://cdot.cc/transcripts/gene/BRCA2

jarbesfeld · 2025-01-13T14:14:19Z

Hi Dave, thanks for your response. To answer your questions:

I downloaded the GENCODE v38 comprehensive annotation GTF at this link and unzipped the file (1.46 GB). Using the pyranges library, I was able to filter for rows in the GTF that have the gene RP5-899B16.3

import pyranges as pr

gr = pr.read_gtf("../../gencode.v38.annotation.gtf") # Change path as needed
ex = gr.df
ex[ex["gene_name"] == "RP5-899B16.3"]

When filtering this output, I saw the transcript ENST00000666152.1 linked to RP5-899B16.3 (in the transcript_id column).

Since this gene symbol only exists in the v38 GENCODE GTF, I was wondering if this data was not being accessed in cdot?

So RP5-899B16.3 does not reference the obsolete gene symbol RP5 but an artificial chromosome that was generated during the Human Genome Project:

A group of volunteers from Roswell Park Comprehensive Cancer Center, five in all, donated their blood for research purposes at the beginning of the Human Genome Project. Blood from two of these volunteers, a man and a woman known only as RP5 and RP11, was used to create the initial framework for sequencing the human genome. This framework, referred to as bacterial artificial chromosomes, or BACs, are fragments of DNA from 250,000 to 500,000 base pairs in size that researchers used to align the final DNA sequence of the human genome.
Source

davmlaw · 2025-01-14T01:34:08Z

We don't use GENCODE. cdot and UTA were originally designed to resolve HGVS, which use RefSeq and Ensembl transcripts in humans

I have raised an issue #90 to investigate GENCODE to see if it's worth using as a source

That being said, if you want to use cdot as a JSON annotation file, to save parsing GTFs etc, fair enough - you could try running the cdot_json.py gtf_to_json on the Gencode GTF and maybe that would help you?

Here's the code to generate the json.gz files - hopefully you can figure out how to run it: https://github.com/SACGF/cdot/blob/main/generate_transcript_data/refseq_transcripts_grch38.sh

jarbesfeld changed the title ~~Accessing transcript data for genes from bacteria artificial chromosomes (BACs)~~ Accessing transcript data for genes from bacterial artificial chromosomes (BACs) Jan 9, 2025

davmlaw mentioned this issue Jan 14, 2025

Investigate using GENCODE as a source of transcripts #90

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

jarbesfeld commented Jan 9, 2025 •

edited

Loading

davmlaw commented Jan 13, 2025

jarbesfeld commented Jan 13, 2025 •

edited

Loading

davmlaw commented Jan 14, 2025

Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

Accessing transcript data for genes from bacterial artificial chromosomes (BACs) #89

Comments

jarbesfeld commented Jan 9, 2025 • edited Loading

davmlaw commented Jan 13, 2025

jarbesfeld commented Jan 13, 2025 • edited Loading

davmlaw commented Jan 14, 2025

jarbesfeld commented Jan 9, 2025 •

edited

Loading

jarbesfeld commented Jan 13, 2025 •

edited

Loading