The distribution is a parallel wrapper around the Arrow consensus framework within the SMRT Analysis Software. The pipeline is composed of bash scripts, an example input fofn which shows how to input your bax.h5 files (you give paths without the .1.bax.h5), and how to launch the pipeline. The input can be either BAX.h5 or BAM files (only P6-C4 chemistry or newer) and requires SMRTportal 3.1+. It can also run the older Quiver algorithm if requested in the CONFIG file on the P6-C4 chemistry data.
The current pipeline has been designed to run on the SGE or SLURM scheduling systems and has hard-coded grid resource request parameters. You must edit arrow.sh to match your grid options. It is, in principle, possible to run on other grid engines but will require editing all shell scripts to not use SGE_TASK_ID but the appropriate variable for your grid environment and editing the qsub commands in arrow.sh to the appropriate commands for your grid environment.
This branch doesn't run BLASR but instead uses minimap2 + pbbamify to make arrow-compatible bam files. This allows much faster alignment and use of references >4gb. However, we've done limited testing to date so the final consensus quality many not be as high as with BLASR.
To run the pipeline you need to:
-
You must have a working SMRT Analysis Software installation and have it configured so the tools are in your path.
-
You must have minimap2, pbbamify (from pbbam package), and samtools in your path.
-
Create the input.fofn file which lists the SMRTcells you want to use for Arrow. For h5 files, specify the full path (excluding .[1-3].bax.h5) which are all treated as a single SMRTcell. For BAM files, specify the full path (including subreads.bam).
-
run the pipeline specifying the input file, a prefix for the outputs, and the path to the reference fasta. Optionally you can also specify a path to a Canu seqStore readNames.txt file if you used trio binning and want to only use classified reads for polishing.
sh arrow.sh input.fofn trio3 trio3.contigs.fasta
The pipeline is very rough and has undergone limited testing so user beware.
If you find this pipeline useful, please cite the original Quiver paper:
Chin et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods, 2013
and the Canu paper:
Koren S et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017).