Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info - zipped files #4

Open
KristinaGagalova opened this issue Jul 20, 2020 · 1 comment
Open

Info - zipped files #4

KristinaGagalova opened this issue Jul 20, 2020 · 1 comment
Assignees

Comments

@KristinaGagalova
Copy link

Hi Alex,
Does sample work with zipped files? Large files are usually compressed.
The command is producing a binary file for me and if I try to <(zcat file.fq.zg) it gives me an error.
What command should I use in this case.
Thank you in advance for the reply

@alexpreynolds
Copy link
Owner

Because sample requires two passes through the input file being sampled from, it is not possible to use a file stream (such as what comes out of <(zcat foo.fq.gz)). A file stream can only be read once.

Native gzip itself has no ability to seek into the archive to get positions of newline characters. But adding support for file offsets into a block-gzipped file (like a bgzip-compressed file that gets used with tabix and other tools in the htslib family) might be a possibility. I'll take a look into that format some time and see if there is potential there to integrate support for bgzip-compressed files.

In the meantime, if you're running into memory errors using GNU shuf and you have sufficient disk space, you could do something like this:

$ LINES_PER_FQ_RECORD=4
$ gunzip -c foo.fq.gz > foo.tmp
$ sample -k ${NUM_SAMPLES} -l ${LINES_PER_FQ_RECORD} foo.tmp > foo.sample
$ rm foo.tmp

There is a time and disk space cost in extracting your gzipped-archive to a temporary file. So if you don't think you will run into memory issues with GNU shuf, you could use that tool instead by linearizing the FASTQ record, sampling some number of records, and then "un"-linearizing the sample, i.e. :

$ gunzip -c foo.fq.gz | awk '{printf("%s%s",$0,((NR+1)%4==1?"\n":"\t"));}' | shuf | head -n ${NUM_SAMPLES} | tr '\t' '\n' > foo.sample

This will use system memory (possibly more than your host may have), but it avoids creating an intermediate temporary file.

Please let me know if you have any other questions about sample and I'll try to help.

@alexpreynolds alexpreynolds self-assigned this Jul 20, 2020
@KristinaGagalova KristinaGagalova changed the title Infor zipped files Info - zipped files Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants