Skip to content

SGN tomato data description

Arnold Kuzniar edited this page Nov 10, 2016 · 7 revisions
  1. Data set metadata

Dataset title: Tomato gene models

Dataset description: The ITAG2.4 release of the official Solanum lycopersicum (cultivar Heinz 1706) genome annotation with 34,725 gene models, available from the Sol Genomics Network (SGN).

Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3

License: ?

Release/version: ITAG2.4 genome annotation (based on SL2.50 genome assembly)

Release issue date: 23-02-2014 (DD-MM-YYYY)

Distribution format: GFF3 (according to gff-version pragma)

  • Note: Strictly speaking, the GFF file does not comply with this specification.

MD5 checksum: 4bf947efde8b0f8101e3bcb9746e5986

  1. Data record metadata

Example:

SL2.50ch00	ITAG_eugene	gene	16437	18189	.	+	.	Alias=Solyc00g005000;ID=gene:Solyc00g005000.2;Name=Solyc00g005000.2;from_BOGAS=1;length=1753
SL2.50ch00	ITAG_eugene	mRNA	16437	18189	.	+	.	ID=mRNA:Solyc00g005000.2.1;Name=Solyc00g005000.2.1;Note=Aspartic proteinase nepenthesin I (AHRD V1 **-- A9ZMF9_NEPAL)%3B contains Interpro domain(s)  IPR001461  Peptidase A1 ;Ontology_term=GO:0006508;Parent=gene:Solyc00g005000.2;from_BOGAS=1;interpro2go_term=GO:0006508;length=1753;nb_exon=2
SL2.50ch00	ITAG_eugene	exon	16437	17275	.	+	.	ID=exon:Solyc00g005000.2.1.1;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1

GFF3 files are nine-column, tab-delimited, plain text files:

Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory

Column 2 "source": data source (constant: ITAG_eugene, refers to Eugene gene predictor), mandatory

Column 3 "type": feature types (gene, mRNA, CDS, exon, intron, five_prime_UTR, three_prime_UTR), mandatory

Column 4 "start": start coordinate of the feature, mandatory

Column 5 "end": end coordinate of the feature, mandatory

Column 6 "score": not available (.)

Column 7 "strand": DNA strandedness (+/-), mandatory

Column 8 "phase": the phase of feature type (CDS or exon) indicates where the feature begins with reference to the reading frame (0, 1 or 2; and '.' used for other features), mandatory

Column 9 "attributes": contains key=value pairs separated by ;

  • ID: unique feature ID (redundantly) prefixed with feature type (e.g. gene:Solyc00g005000.2 or mRNA:Solyc00g005000.2.1) and (inconsistently) used in the non-prefixed form at the web front-end.

  • Name: the non-prefixed form of feature ID (e.g. Solyc00g005000.2 or Solyc00g005000.2.1)

  • Parent: refers to the parent ID of this (child) feature, indicates part-of relation (e.g. to group transcripts into genes or exons into transcripts)

  • Note: function annotation of transcripts based on homology to (plant) proteins in UniProtKB, domains/motifs in InterPro

  • Ontology_term, interpro2go_term or Sifter_term cross-references to Gene Ontology term IDs

  1. Data set metadata

Dataset title: Tomato SGN genetic markers

Dataset description: The original dataset contains alignments to SGN unigenes, SGN marker sequences and SGN locus sequences. Only SGN markers are imported (in total 5077 ITAG_sgn_markers).

Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_sgn_data.gff3

License: ?

Release/version: ITAG2.4

Release issue date: 23-02-2014

Distribution format: GFF3 (according to gff-version pragma)

MD5 checksum: 939cf6f468eab5572653b626d5078aaa

  1. Data record metadata

Example:

SL2.50ch00	ITAG_sgn_markers	match	3999461	4000061	0.989	-	.	Alias=SGN-M676;ID=gene1_0-i2;Name=SSR3;Note=marker name(s): SSR3%2C SGN-M676;Target=SGN-M676 1 601 +

Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory

Column 2 "source": data source (constant: ITAG_sgn_markers), mandatory

Column 3 "type": feature type match used only (although variant would be more correct)

Column 4 "start": start coordinate of the feature, mandatory

Column 5 "end": end coordinate of the feature, mandatory

Column 6 "score": not relevant

Column 7 "strand": not relevant

Column 8 "phase": not relevant

Column 9 "attributes": contains key=value pairs separated by ;

  • ID: unique marker ID (e.g. gene1_0-i2), mandatory

  • Name: maker as known in literature? (e.g. cLER-14-H18), mandatory

  • Alias: alternative name for the marker (e.g. SGN-M2995), mandatory

  • Note: concatenation of both Name and Alias values (redundant)

  1. Data set metadata

Dataset title: Tomato SolCAP genetic markers

Dataset description: The dataset contains SolCAP genetic markers (in total 8760 SNPs).

Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_solCAP.gff3

License: ?

Release/version: ITAG2.4

Release issue date: 11-07-2014

Distribution format: GFF3 (according to gff-version pragma)

MD5 checksum: 6d1e291acfa8f20cf89438e521315c80

  1. Data record metadata

Example:

SL2.50ch00	ITAG_sgn_markers	match	16728330	16728330	.	+	.	ID=solcap_snp_sl_100476;Name=solcap_snp_sl_100476;Alias=solcap_snp_sl_100476;Note=marker name(s): solcap_snp_sl_100476;Target=solcap_snp_sl_100476 1 1 +

Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory

Column 2 "source": data source (constant: ITAG_sgn_markers), mandatory

Column 3 "type": feature type match (although variant would be more correct)

Column 4 "start": start coordinate of the feature, mandatory

Column 5 "end": end coordinate of the feature, mandatory

Column 6 "score": not relevant

Column 7 "strand": not relevant

Column 8 "phase": not relevant

Column 9 "attributes": contains key=value pairs separated by ;

  • ID: unique marker ID (e.g. solcap_snp_sl_100476), mandatory

  • Name, Alias and Note: same as ID (redundant)

  1. Data set metadata

Dataset title: Wild tomato genome annotation

Dataset description: The genome of the stress-tolerant wild tomato species Solanum pennellii (Bolger et al. 2014).

Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_pennellii/spenn_v2.0_gene_models_annot.gff

License: ?

Release/version: 2.0 (not official but deduced from the file name)

Release issue date: 27-08-2014

Distribution format: GFF3 (according to gff-version pragma)

MD5 checksum: 71158bb0bf7bb323644c52c0fde37dc8

  1. Data record metadata

Example:


Spenn-ch01	AUGUSTUS	gene	6838142	6841569	0.2	+	.	ID=Sopen01g006750;Name=Sopen01g006750;
Spenn-ch01	AUGUSTUS	mRNA	6838142	6841569	0.2	+	.	ID=Sopen01g006750.1;Name=Sopen01g006750.1;Parent=Sopen01g006750;Note=Member of the R2R3 factor gene family. | myb domain protein 16 (MYB16) | CONTAINS InterPro DOMAIN/s: SANT, DNA-binding , Homeodomain-like , Myb, DNA-binding , Homeodomain-related , Myb transcription factor , HTH transcriptional regulator, Myb-type, DNA-binding | BEST Arabidopsis thaliana protein match is: myb domain protein 106;
Spenn-ch01	AUGUSTUS	exon	6838142	6838355	.	+	.

GFF3 files are nine-column, tab-delimited, plain text files:

Column 1 "seqid": chromosome numbers (e.g. Spenm-ch00..ch12), mandatory

  • Note: Link the chromosomes to ENA/GenBank. Apparently, there are three S.pennellii genome assemblies in ENA.

Column 2 "source": data source (constant: AUGUSTUS gene predictor), mandatory

Column 3 "type": feature types (gene, mRNA, CDS, exon, intron but no five_prime_UTR or three_prime_UTR), mandatory

Column 4 "start": start coordinate of the feature, mandatory

Column 5 "end": end coordinate of the feature, mandatory

Column 6 "score": values between 0 and 4 for gene, mRNA, CDS and intron features but '.' for exons

Column 7 "strand": DNA strandedness (+/-), mandatory

Column 8 "phase": the phase of feature type (CDS or exon) indicates where the feature begins with reference to the reading frame (0, 1 or 2; and '.' used for other features), mandatory

Column 9 "attributes": contains key=value pairs separated by ;

  • ID: unique feature ID (e.g. Sopen01g006750 or Sopen01g006750.1) but CDS/exon/intron IDs are prefixed with feature type (e.g. cds:Sopen01g006750.1.1)!

  • Name: same as ID (redundant)

  • Parent: refers to the parent ID of this (child) feature, indicates part-of relation (e.g. to group transcripts into genes or exons into transcripts)

  • Note: function annotation of transcripts but without reference to e.g. UniProtKB, InterPro or GO term accessions/IDs

  1. Data set metadata

Dataset title: Wild tomato SGN genetic markers

Dataset description: The dataset contains SGN genetic markers for S.pennellii (in total 2225).

Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_pennellii/sgnMarkersSpenn.gff3

License: ?

Release/version: ?

Release issue date: 08-10-2014

Distribution format: GFF (non-standard, gff-version pragma missing)

MD5 checksum: 047c9b3b8ed0bd0f813bd897e502529b

  1. Data record metadata

Example:

Spenn-ch12      sgn_markers     match   2621812 2622049 .       +       .       Alias=SGN-M1347;ID=T0028;Note=marker name(s): T0028 SGN-M1347 |identity=99.58|escore=2e-126

Column 1 "seqid": chromosome numbers (e.g. Spenn-ch00..ch12), mandatory

Column 2 "source": data source (constant: sgn_markers), mandatory

Column 3 "type": feature type match (although variant would be more correct)

Column 4 "start": start coordinate of the feature, mandatory

Column 5 "end": end coordinate of the feature, mandatory

Column 6 "score": not relevant

Column 7 "strand": not relevant

Column 8 "phase": not relevant

Column 9 "attributes": contains key=value pairs separated by ;

  • ID: unique marker ID (e.g. T0028), mandatory

    • Note: There are 25 duplicates (markers) found with the same ID (e.g. P1).
  • Alias : alternative name/ID for the marker (e.g. SGN-M1347)

  • Note : contains both ID and Alias (redundant) followed by ill-formated/delimited pairs |identity=...|escore=...