You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've read in the docs about the problems with converting nested JSON to other (tabular) formats that miller supports and I realized - wait, I work with nested data in tabular file format (TSV in fact, with some header) on a daily basis. Of course, the two approaches you implemented are fine, but maybe this can be interesting too...
So, I work in genomics, which is a field filled with tabular data (our favourite separator seems to be the Tab 🌠). In population genomics, the most important format is probably the Variant Call Format, or VCF.
VCF is basically a TSV with a header. And a few neat tricks. For instance, the header has two "levels" - most header lines start with ## and contain all kinds of metadata. The last header line starts only with a single # though, and it contains column names. It's easy to grep -v '##' to remove the header, but keeping the column names.
Nesting, right...
Another cool feature of the VCF format are nested fields. Basically, some fields have secondary separator. This is the case of e.g. ALT, INFO, or FORMAT fields. This means that for every genetic position (row) in the file, the INFO field has a whole map of metadata. And the metadata included can even vary from position to position.
NS=3;DP=14;AF=0.5;DB;H2
NS=3;DP=11;AF=0.017
Of course, there are special tools to parse the VCF and extract these tags, as we call them, but I think it's also easy enough to code something ad-hoc in awk or perl, using variable separators. But wait, there is more!
Is that a - data frame?
The best for the last. The FORMAT field is actually a header. This last fixed column is then followed by variable number of sample-specific columns, with repeating information about the tags present in the FORMAT field. Here is a format field followed by two samples, for two genetic positions (rows):
This basically makes every row of the VCF file a data frame of genotype information, for every sample at every given position!
Sure, the nesting will probably not get more deep than that (although some of the tags do have nested info in them, but sooner or later you run out of separators). But, if we are talking about saving a miller array into a column of a table - well, this is easy ☺️
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi again,
I've read in the docs about the problems with converting nested JSON to other (tabular) formats that miller supports and I realized - wait, I work with nested data in tabular file format (TSV in fact, with some header) on a daily basis. Of course, the two approaches you implemented are fine, but maybe this can be interesting too...
So, I work in genomics, which is a field filled with tabular data (our favourite separator seems to be the Tab 🌠). In population genomics, the most important format is probably the Variant Call Format, or VCF.
VCF is basically a TSV with a header. And a few neat tricks. For instance, the header has two "levels" - most header lines start with
##
and contain all kinds of metadata. The last header line starts only with a single#
though, and it contains column names. It's easy togrep -v '##'
to remove the header, but keeping the column names.Nesting, right...
Another cool feature of the VCF format are nested fields. Basically, some fields have secondary separator. This is the case of e.g.
ALT
,INFO
, orFORMAT
fields. This means that for every genetic position (row) in the file, theINFO
field has a whole map of metadata. And the metadata included can even vary from position to position.Of course, there are special tools to parse the VCF and extract these tags, as we call them, but I think it's also easy enough to code something ad-hoc in awk or perl, using variable separators. But wait, there is more!
Is that a - data frame?
The best for the last. The
FORMAT
field is actually a header. This last fixed column is then followed by variable number of sample-specific columns, with repeating information about the tags present in theFORMAT
field. Here is a format field followed by two samples, for two genetic positions (rows):This basically makes every row of the VCF file a data frame of genotype information, for every sample at every given position!
Sure, the nesting will probably not get more deep than that (although some of the tags do have nested info in them, but sooner or later you run out of separators). But, if we are talking about saving a miller array into a column of a table - well, this is easy☺️
Beta Was this translation helpful? Give feedback.
All reactions