From b581358794c0cab8b6f2a742ddf9738f1ebf60fd Mon Sep 17 00:00:00 2001 From: Adrian Viehweger Date: Wed, 22 Mar 2017 13:44:59 +0100 Subject: [PATCH] annotation.json schema: added "syn" (synonyms) field --- examples/import_ncbi_flu_ftp_dump.md | 3 +++ zoo/schema/README.md | 40 +--------------------------- zoo/schema/annotation.json | 1 + 3 files changed, 5 insertions(+), 39 deletions(-) create mode 100644 examples/import_ncbi_flu_ftp_dump.md diff --git a/examples/import_ncbi_flu_ftp_dump.md b/examples/import_ncbi_flu_ftp_dump.md new file mode 100644 index 0000000..6eb00e3 --- /dev/null +++ b/examples/import_ncbi_flu_ftp_dump.md @@ -0,0 +1,3 @@ +## make this an ipython notebook + +... that runs the corresponding import script. \ No newline at end of file diff --git a/zoo/schema/README.md b/zoo/schema/README.md index 9b759f2..f1e70a2 100644 --- a/zoo/schema/README.md +++ b/zoo/schema/README.md @@ -1,41 +1,3 @@ ## Schema -zoo is a sequence centric data structure, which influences the database schema design. A sequence record comprises 4 main components or "fields": - -- sequence -- metadata -- relative -- derivative - -Sequence information is at the center of zoo's functionalities. It is defined as a string of an arbitrary alphabet, typically RNA, DNA or Protein. As a consequence, each genome segment of a segmented virus such as Influenza A receives its own document, and is linked to the other segments of a given sample. -Metadata describe the way a sequence "came to be known". Where was it sampled from, who by, from which host, through which sample preparation and sequencing methods? -Relative information includes taxonomy, phylogeny and linked information. It addressed the question of how a given sequence string compares to others. Parts of phylogenetic trees or multiple sequence alignments are archived in this category. -Derived information summarizes or reexpresses the information contained in the sequence, including annotations, minhashes and alternative encodings. Derived information is usually heavily dependent on the original sequence. For example, the annotation open reading frame (ORF) derives from the sequence's start and a stop codon position. By definition, derived sequence information does not by itself make any sense without the underlying raw sequence information. -Note that all categories interact, e.g. we could use minhash signatures (derivative) to compare a sequence to other ones in the database, storing the top 5 closest sequence IDs (relative). - -## Working with schemas - -Since zoo is not restricted to any particular use case, some schemas are presented here to give an idea how data can be structured in zoo for different viruses and sets of sequences. - -Note that schemas are composable: The Influenza A virus schema can incorporate the (intentionally generic) annotation schema, if the latter suits the intended purpose. - -Note also that the base schema is "assumed" by some of the zoo package functions, so if you tinker with it (especially deleting keys) things might break. Which is ok, you can probably fix them. - -Access to the schemas: - -``` -import json -from zoo import get_schema - -with open(get_schema('influenza_a_virus.json')) as infile: - iav = json.load(infile) -with open(get_schema('annotation.json')) as infile: - anno = json.load(infile) - -iav['derivative']['annotation'].append(anno) -``` - -## Technicalities - -- representing "empty" in JSON, tl;dr: there is no standard (stackoverflow, 21120999) -- ... \ No newline at end of file +For details on how to use and compose schemas, see the corresponding [wiki entry](https://github.com/viehwegerlib/zoo/wiki/Composing-schemas). \ No newline at end of file diff --git a/zoo/schema/annotation.json b/zoo/schema/annotation.json index d04632b..fcaa6b6 100644 --- a/zoo/schema/annotation.json +++ b/zoo/schema/annotation.json @@ -2,6 +2,7 @@ "end": null, "fuzzy": null, "id": null, + "syn": "", "name": "", "source": "", "start": null