Skip to content

Simulating long reads

Aaron McKenna edited this page Aug 1, 2024 · 8 revisions

Simulating long reads

Simulation of Nanopore reads

Simlulate the underlying sequences

The underlying sequence and barcodes are combined in: simulate_reads_2024_07_31.ipynb

Barcodes from freebarcodes, using the 15 nucleotide barcodes with two-base error correction.

Creating nanopore errors

We generally want full-length sequences, as this is what we get from amplicon sequencing:

badread simulate --length 15000,100 --reference test_basic_long_read_no_error.fa --quantity 50x | gzip > reads.fastq.gz

results:

Generating fragment lengths from a gamma distribution:
  mean  =  15000 bp      parameters:
  stdev =  13000 bp        k (shape)     = 1.3314e+00
  N50   =  22622 bp        theta (scale) = 1.1267e+04
  │  ▖▌▌▌▌▌▖▖▖
  │ ▖▌▌▌▌▌▌▌▌▌▌▌▖▖
f │ ▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖
r │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖
a │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖
g │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖
s │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖▖▖
  │▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖▖▖▖▖▖▖▖▖▖▖▖▖▖▖▖
  ├─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
  │               ▖▖▖▌▌▌▌▌▌▌▌▌▖▖▖▖▖
  │           ▖▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖
b │         ▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖
a │       ▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖
s │     ▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖
e │    ▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖▖
s │  ▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖▖▖▖▖
  │ ▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▖▖▖▖▖▖▖
  ├─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
  0         6800      13600     20400     27200     34000     40800     47600     54400     61200     68000

Generating read identities from a beta distribution:
  mean  =  95%      shape parameters:
  max   =  99%        alpha = 5.7384e+01
  stdev = 2.5%        beta  = 2.4162e+00
  │                                                                                           ▖▌▌▖
  │                                                                                          ▌▌▌▌▌
  │                                                                                        ▖▌▌▌▌▌▌▌
  │                                                                                       ▖▌▌▌▌▌▌▌▌
  │                                                                                      ▌▌▌▌▌▌▌▌▌▌▌
  │                                                                                    ▖▌▌▌▌▌▌▌▌▌▌▌▌
  │                                                                                 ▖▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌
  │                                                                           ▖▖▖▖▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌▌
  ├─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
  50        55        60        65        70        75        80        85        90        95        100

Align sequences to the genome

minimap2 -a minimap_alignment_reference_2024_07_31.fa reads.fastq.gz > 2024_07_31_aligned_simulated_reads.sam

Convert to BAM, sort, index

samtools view -b -h 2024_07_31_aligned_simulated_reads.sam > 2024_07_31_aligned_simulated_reads.bam
samtools sort 2024_07_31_aligned_simulated_reads.bam > 2024_07_31_aligned_simulated_reads_sorted.bam
samtools index 2024_07_31_aligned_simulated_reads_sorted.bam

Simulate non-error prone reads

After running ipython notebook:

gzip simulation/test_basic_long_read_no_error.fq

Align sequences to the genome

minimap2 -a minimap_alignment_reference_2024_07_31.fa test_basic_long_read_no_error.fq.gz > 2024_07_31_aligned_perfect_reads.sam

Convert to BAM, sort, index

samtools view -b -h 2024_07_31_aligned_perfect_reads.sam > 2024_07_31_aligned_perfect_reads.bam
samtools sort 2024_07_31_aligned_perfect_reads.bam > 2024_07_31_aligned_perfect_reads_sorted.bam
samtools index 2024_07_31_aligned_perfect_reads_sorted.bam

Simulate non-error prone reads with three barcodes

After running ipython notebook:

gzip simulation/test_basic_long_read_no_error_3code.fq

Align sequences to the genome

minimap2 -a minimap_alignment_reference_2024_07_31.fa test_basic_long_read_no_error_3code.fq.gz > 2024_07_31_aligned_perfect_reads_tripple.sam

Convert to BAM, sort, index

samtools view -b -h 2024_07_31_aligned_perfect_reads_tripple.sam > 2024_07_31_aligned_perfect_reads_tripple.bam
samtools sort 2024_07_31_aligned_perfect_reads_tripple.bam > 2024_07_31_aligned_perfect_reads_tripple_sorted.bam
samtools index 2024_07_31_aligned_perfect_reads_tripple_sorted.bam