-
Notifications
You must be signed in to change notification settings - Fork 4
SGN tomato data description
- Data set metadata
Dataset title: Tomato gene models
Dataset description: The ITAG2.4 release of the official Solanum lycopersicum (cultivar Heinz 1706) genome annotation with 34,725 gene models, available from the Sol Genomics Network (SGN).
Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_gene_models.gff3
License: ?
Release/version: ITAG2.4 genome annotation (based on SL2.50 genome assembly)
Release issue date: 23-02-2014 (DD-MM-YYYY)
Distribution format: GFF3 (according to gff-version
pragma)
- Note: Strictly speaking, the GFF file does not comply with this specification.
MD5 checksum: 4bf947efde8b0f8101e3bcb9746e5986
- Data record metadata
Example:
SL2.50ch00 ITAG_eugene gene 16437 18189 . + . Alias=Solyc00g005000;ID=gene:Solyc00g005000.2;Name=Solyc00g005000.2;from_BOGAS=1;length=1753 SL2.50ch00 ITAG_eugene mRNA 16437 18189 . + . ID=mRNA:Solyc00g005000.2.1;Name=Solyc00g005000.2.1;Note=Aspartic proteinase nepenthesin I (AHRD V1 **-- A9ZMF9_NEPAL)%3B contains Interpro domain(s) IPR001461 Peptidase A1 ;Ontology_term=GO:0006508;Parent=gene:Solyc00g005000.2;from_BOGAS=1;interpro2go_term=GO:0006508;length=1753;nb_exon=2 SL2.50ch00 ITAG_eugene exon 16437 17275 . + . ID=exon:Solyc00g005000.2.1.1;Parent=mRNA:Solyc00g005000.2.1;from_BOGAS=1
GFF3 files are nine-column, tab-delimited, plain text files:
Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory
- Note: Link the chromosomes to ENA/GenBank accessions (e.g. SL2.50ch01 -> CM001064.2).
Column 2 "source": data source (constant: ITAG_eugene, refers to Eugene gene predictor), mandatory
Column 3 "type": feature types (gene, mRNA, CDS, exon, intron, five_prime_UTR, three_prime_UTR), mandatory
Column 4 "start": start coordinate of the feature, mandatory
Column 5 "end": end coordinate of the feature, mandatory
Column 6 "score": not available (.)
Column 7 "strand": DNA strandedness (+/-), mandatory
Column 8 "phase": the phase of feature type (CDS or exon) indicates where the feature begins with reference to the reading frame (0, 1 or 2; and '.' used for other features), mandatory
Column 9 "attributes": contains key=value pairs separated by ;
-
ID: unique feature ID (redundantly) prefixed with feature type (e.g. gene:Solyc00g005000.2 or mRNA:Solyc00g005000.2.1) and (inconsistently) used in the non-prefixed form at the web front-end.
-
Name: the non-prefixed form of feature ID (e.g. Solyc00g005000.2 or Solyc00g005000.2.1)
-
Parent: refers to the parent ID of this (child) feature, indicates part-of relation (e.g. to group transcripts into genes or exons into transcripts)
-
Note: function annotation of transcripts based on homology to (plant) proteins in UniProtKB, domains/motifs in InterPro
-
Ontology_term, interpro2go_term or Sifter_term cross-references to Gene Ontology term IDs
- Data set metadata
Dataset title: Tomato SGN genetic markers
Dataset description: The original dataset contains alignments to SGN unigenes, SGN marker sequences and SGN locus sequences. Only SGN markers are imported (in total 5077 ITAG_sgn_markers
).
Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_sgn_data.gff3
License: ?
Release/version: ITAG2.4
Release issue date: 23-02-2014
Distribution format: GFF3 (according to gff-version
pragma)
MD5 checksum: 939cf6f468eab5572653b626d5078aaa
- Data record metadata
Example:
SL2.50ch00 ITAG_sgn_markers match 3999461 4000061 0.989 - . Alias=SGN-M676;ID=gene1_0-i2;Name=SSR3;Note=marker name(s): SSR3%2C SGN-M676;Target=SGN-M676 1 601 +
Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory
Column 2 "source": data source (constant: ITAG_sgn_markers), mandatory
Column 3 "type": feature type match
used only (although variant
would be more correct)
Column 4 "start": start coordinate of the feature, mandatory
Column 5 "end": end coordinate of the feature, mandatory
Column 6 "score": not relevant
Column 7 "strand": not relevant
Column 8 "phase": not relevant
Column 9 "attributes": contains key=value pairs separated by ;
-
ID: unique marker ID (e.g. gene1_0-i2), mandatory
-
Name: maker as known in literature? (e.g. cLER-14-H18), mandatory
-
Alias: alternative name for the marker (e.g. SGN-M2995), mandatory
-
Note: concatenation of both Name and Alias values (redundant)
- Data set metadata
Dataset title: Tomato SolCAP genetic markers
Dataset description: The dataset contains SolCAP genetic markers (in total 8760 SNPs).
Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_lycopersicum/annotation/ITAG2.4_release/ITAG2.4_solCAP.gff3
License: ?
Release/version: ITAG2.4
Release issue date: 11-07-2014
Distribution format: GFF3 (according to gff-version
pragma)
MD5 checksum: 6d1e291acfa8f20cf89438e521315c80
- Data record metadata
Example:
SL2.50ch00 ITAG_sgn_markers match 16728330 16728330 . + . ID=solcap_snp_sl_100476;Name=solcap_snp_sl_100476;Alias=solcap_snp_sl_100476;Note=marker name(s): solcap_snp_sl_100476;Target=solcap_snp_sl_100476 1 1 +
Column 1 "seqid": chromosome numbers (e.g. SL2.50ch00..ch12), mandatory
Column 2 "source": data source (constant: ITAG_sgn_markers), mandatory
Column 3 "type": feature type match
(although variant
would be more correct)
Column 4 "start": start coordinate of the feature, mandatory
Column 5 "end": end coordinate of the feature, mandatory
Column 6 "score": not relevant
Column 7 "strand": not relevant
Column 8 "phase": not relevant
Column 9 "attributes": contains key=value pairs separated by ;
-
ID: unique marker ID (e.g. solcap_snp_sl_100476), mandatory
-
Name, Alias and Note: same as ID (redundant)
- Data set metadata
Dataset title: Wild tomato genome annotation
Dataset description: The genome of the stress-tolerant wild tomato species Solanum pennellii (Bolger et al. 2014).
Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_pennellii/spenn_v2.0_gene_models_annot.gff
License: ?
Release/version: 2.0 (not official but deduced from the file name)
Release issue date: 27-08-2014
Distribution format: GFF3 (according to gff-version
pragma)
MD5 checksum: 71158bb0bf7bb323644c52c0fde37dc8
- Data record metadata
Example:
Spenn-ch01 AUGUSTUS gene 6838142 6841569 0.2 + . ID=Sopen01g006750;Name=Sopen01g006750;
Spenn-ch01 AUGUSTUS mRNA 6838142 6841569 0.2 + . ID=Sopen01g006750.1;Name=Sopen01g006750.1;Parent=Sopen01g006750;Note=Member of the R2R3 factor gene family. | myb domain protein 16 (MYB16) | CONTAINS InterPro DOMAIN/s: SANT, DNA-binding , Homeodomain-like , Myb, DNA-binding , Homeodomain-related , Myb transcription factor , HTH transcriptional regulator, Myb-type, DNA-binding | BEST Arabidopsis thaliana protein match is: myb domain protein 106;
Spenn-ch01 AUGUSTUS exon 6838142 6838355 . + .
GFF3 files are nine-column, tab-delimited, plain text files:
Column 1 "seqid": chromosome numbers (e.g. Spenm-ch00..ch12), mandatory
- Note: Link the chromosomes to ENA/GenBank. Apparently, there are three S.pennellii genome assemblies in ENA.
Column 2 "source": data source (constant: AUGUSTUS gene predictor), mandatory
Column 3 "type": feature types (gene, mRNA, CDS, exon, intron but no five_prime_UTR or three_prime_UTR), mandatory
Column 4 "start": start coordinate of the feature, mandatory
Column 5 "end": end coordinate of the feature, mandatory
Column 6 "score": values between 0 and 4 for gene, mRNA, CDS and intron features but '.' for exons
Column 7 "strand": DNA strandedness (+/-), mandatory
Column 8 "phase": the phase of feature type (CDS or exon) indicates where the feature begins with reference to the reading frame (0, 1 or 2; and '.' used for other features), mandatory
Column 9 "attributes": contains key=value pairs separated by ;
-
ID: unique feature ID (e.g. Sopen01g006750 or Sopen01g006750.1) but CDS/exon/intron IDs are prefixed with feature type (e.g. cds:Sopen01g006750.1.1)!
-
Name: same as ID (redundant)
-
Parent: refers to the parent ID of this (child) feature, indicates part-of relation (e.g. to group transcripts into genes or exons into transcripts)
-
Note: function annotation of transcripts but without reference to e.g. UniProtKB, InterPro or GO term accessions/IDs
- Data set metadata
Dataset title: Wild tomato SGN genetic markers
Dataset description: The dataset contains SGN genetic markers for S.pennellii (in total 2225).
Download URL: ftp://ftp.solgenomics.net/genomes/Solanum_pennellii/sgnMarkersSpenn.gff3
License: ?
Release/version: ?
Release issue date: 08-10-2014
Distribution format: GFF (non-standard, gff-version
pragma missing)
MD5 checksum: 047c9b3b8ed0bd0f813bd897e502529b
- Data record metadata
Example:
Spenn-ch12 sgn_markers match 2621812 2622049 . + . Alias=SGN-M1347;ID=T0028;Note=marker name(s): T0028 SGN-M1347 |identity=99.58|escore=2e-126
Column 1 "seqid": chromosome numbers (e.g. Spenn-ch00..ch12), mandatory
Column 2 "source": data source (constant: sgn_markers), mandatory
Column 3 "type": feature type match
(although variant
would be more correct)
Column 4 "start": start coordinate of the feature, mandatory
Column 5 "end": end coordinate of the feature, mandatory
Column 6 "score": not relevant
Column 7 "strand": not relevant
Column 8 "phase": not relevant
Column 9 "attributes": contains key=value pairs separated by ;
-
ID: unique marker ID (e.g. T0028), mandatory
- Note: There are 25 duplicates (markers) found with the same ID (e.g. P1).
-
Alias : alternative name/ID for the marker (e.g. SGN-M1347)
-
Note : contains both ID and Alias (redundant) followed by ill-formated/delimited pairs
|identity=...|escore=...
ODEX4all