Skip to content

Binary database format

Aaron McKenna edited this page Apr 9, 2019 · 1 revision

Binary file format

The tool uses a custom binary format that compresses the genome hits for a target sequences into binary values. Values are stored as using Java's default big-endian format. The database itself contains all the packed long values in a compressed file, using htslib's block-compressed file readers and writers. The header file sits alongside the database and provides a lookup table for the target database.

The block format in the database is:

  • block type: currently 0 for linear, 1 for sub-indexed (long)
  • if indexed, for the set index size, a series of long values that describe the size and location of sub-indexes
  • After any header, a series of targets, with X position encodings, set by the count number in the target encoding

The header format:

  • 64 bit magic value (long)

  • 64 bit version number (long)

  • 64 bit internal enzyme number (long, see StandardScanParameters.scala for the enumerated enzymes)

  • 64 bit number of bins (long)

  • for the number of bins in the header:

    • 64 bit offset of this block in the file (long)
    • 64 bit size of this block in the file (in uncompressed bytes, long)
    • 64 bit number of targets contained within the bin (int)
  • for each contig in the input reference file:

    • the contig name, followed by a '=' and it's index position in the reference

For lookup we perform something like the following: FlashFrySearch