Nested fields in tabular data - the VCF case #1116

janxkoci · 2022-10-25T23:05:49Z

janxkoci
Oct 25, 2022

Hi again,

I've read in the docs about the problems with converting nested JSON to other (tabular) formats that miller supports and I realized - wait, I work with nested data in tabular file format (TSV in fact, with some header) on a daily basis. Of course, the two approaches you implemented are fine, but maybe this can be interesting too...

So, I work in genomics, which is a field filled with tabular data (our favourite separator seems to be the Tab 🌠). In population genomics, the most important format is probably the Variant Call Format, or VCF.

VCF is basically a TSV with a header. And a few neat tricks. For instance, the header has two "levels" - most header lines start with ## and contain all kinds of metadata. The last header line starts only with a single # though, and it contains column names. It's easy to grep -v '##' to remove the header, but keeping the column names.

Nesting, right...

Another cool feature of the VCF format are nested fields. Basically, some fields have secondary separator. This is the case of e.g. ALT, INFO, or FORMAT fields. This means that for every genetic position (row) in the file, the INFO field has a whole map of metadata. And the metadata included can even vary from position to position.

NS=3;DP=14;AF=0.5;DB;H2
NS=3;DP=11;AF=0.017

Of course, there are special tools to parse the VCF and extract these tags, as we call them, but I think it's also easy enough to code something ad-hoc in awk or perl, using variable separators. But wait, there is more!

Is that a - data frame?

The best for the last. The FORMAT field is actually a header. This last fixed column is then followed by variable number of sample-specific columns, with repeating information about the tags present in the FORMAT field. Here is a format field followed by two samples, for two genetic positions (rows):

GT:GQ:DP:HQ  0|0:54:7:56,60  0|0:48:4:51,51
GT:GQ:DP     0/1:35:4        0/2:17:2

This basically makes every row of the VCF file a data frame of genotype information, for every sample at every given position!

Sure, the nesting will probably not get more deep than that (although some of the tags do have nested info in them, but sooner or later you run out of separators). But, if we are talking about saving a miller array into a column of a table - well, this is easy ☺️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested fields in tabular data - the VCF case #1116

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Nested fields in tabular data - the VCF case #1116

janxkoci Oct 25, 2022

Nesting, right...

Is that a - data frame?

Replies: 0 comments

janxkoci
Oct 25, 2022