Info - zipped files #4

KristinaGagalova · 2020-07-20T19:49:42Z

Hi Alex,
Does sample work with zipped files? Large files are usually compressed.
The command is producing a binary file for me and if I try to <(zcat file.fq.zg) it gives me an error.
What command should I use in this case.
Thank you in advance for the reply

The text was updated successfully, but these errors were encountered:

alexpreynolds · 2020-07-20T20:28:08Z

Because sample requires two passes through the input file being sampled from, it is not possible to use a file stream (such as what comes out of <(zcat foo.fq.gz)). A file stream can only be read once.

Native gzip itself has no ability to seek into the archive to get positions of newline characters. But adding support for file offsets into a block-gzipped file (like a bgzip-compressed file that gets used with tabix and other tools in the htslib family) might be a possibility. I'll take a look into that format some time and see if there is potential there to integrate support for bgzip-compressed files.

In the meantime, if you're running into memory errors using GNU shuf and you have sufficient disk space, you could do something like this:

$ LINES_PER_FQ_RECORD=4
$ gunzip -c foo.fq.gz > foo.tmp
$ sample -k ${NUM_SAMPLES} -l ${LINES_PER_FQ_RECORD} foo.tmp > foo.sample
$ rm foo.tmp

There is a time and disk space cost in extracting your gzipped-archive to a temporary file. So if you don't think you will run into memory issues with GNU shuf, you could use that tool instead by linearizing the FASTQ record, sampling some number of records, and then "un"-linearizing the sample, i.e. :

$ gunzip -c foo.fq.gz | awk '{printf("%s%s",$0,((NR+1)%4==1?"\n":"\t"));}' | shuf | head -n ${NUM_SAMPLES} | tr '\t' '\n' > foo.sample

This will use system memory (possibly more than your host may have), but it avoids creating an intermediate temporary file.

Please let me know if you have any other questions about sample and I'll try to help.

alexpreynolds self-assigned this Jul 20, 2020

alexpreynolds added the enhancement label Jul 20, 2020

KristinaGagalova changed the title ~~Infor zipped files~~ Info - zipped files Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Info - zipped files #4

Info - zipped files #4

KristinaGagalova commented Jul 20, 2020

alexpreynolds commented Jul 20, 2020

Info - zipped files #4

Info - zipped files #4

Comments

KristinaGagalova commented Jul 20, 2020

alexpreynolds commented Jul 20, 2020