Skip to content

mklarqvist/StormBitmaps

Repository files navigation

Build Status Build Status Github Releases License

Storm bitmaps

These algorithms and bitmaps are used to compute XXT for a binary input matrix with dimensions (N,M) using specialized CPU instructions i.e. POPCNT, SSE4.2, AVX2, AVX512BW, NEON. This is equivalent to computing the all-vs-all set intersection cardinality (|Xi ∩ XTj|) for pairs of symmetric integer sets. These algorithms are fast in the worst case and extremely fast when the input matrix is sparse.

screenshot

Using large registers (AVX-512BW), contiguous and aligned memory, and cache-aware blocking, we can achieve ~114 GB/s (~0.2 CPU cycles / 64-bit word) of sustained throughput (~14 billion 64-bit bitmaps / second or up to ~912 billion implicit integers / second) using STORM_contiguous_t when the input data is small (N < 256,000). When input data is large, we can achieve around 0.4-0.6 CPU cycles / 64-bit word using STORM_t while using considerably less memory. Both of these models make use of scalar-bitmap or scalar-scalar comparisons when the data density is small. Storm selects the optimal memory alignment and subroutines given the available SIMD instruction at run-time by using libalgebra.

The core algorithms are described in the papers:

Performance

All performance tests were run on a host machine with a 10 nm Cannon Lake Core i3-8121U with gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3). Detailed benchmarking requires the Linux perf subsystem. In all the examples below we do not output the result matrix but instead report its total sum. This was done to restrict our measurements to algorithmic performance while being unaffacted by disk I/O.

Small input matrix

Sample performance metrics (practical upper limit) for dense matrices using a host machine with AVX512BW available. We simulate many data arrays in aligned memory and compute the upper triangular or XXT using the command benchmark 65536 10000

Set bits CPU cycles / 64-bit word MB/s
32768 0.209 109915
16384 0.21 113591
6553 0.21 114524
2621 0.21 114256
1310 0.21 114625
655 0.21 114709
262 0.21 114659
65 0.21 114390
13 0.21 114726
5 0.21 114574
1 0.21 114457

Large input matrix

Next we simulated 10,000 vectors with [0,262144] random bits set for each to form a (10000,524288)-dimension matrix with different data densities. The following (misleading) figures represent CPU cycles / 64-bit word equivalent if the analysis was run in uncompressed bitmap space. We compare our simple approach to Roaring bitmaps demonstrating we can achieve performance parity on large bitmaps:

Set bits Storm-CPU-cycles Roaring-CPU-cycles
262144 0.483 0.567
131072 0.477 0.567
52428 0.474 0.567
20971 3.431 3.342
10485 1.765 1.756
5242 0.935 0.945
2097 0.43 0.451
524 0.19 0.196
104 0.117 0.133
5 0.017 0.029
1 0.003 0.004

API

The interface is found in the file storm.h.

#include "storm.h"

int intcmp(const void* aa, const void* bb) {
    const uint32_t* a = aa, *b = bb;
    return (*a < *b) ? 0 : (*a > *b);
}

int main() {
    uint32_t n_vectors = 10000; // number of rows
    uint32_t n_width = 1000; // number of columns
    uint32_t n_bits_set = 128; // number of bits set per row

    STORM_t* storm = STORM_new(); // STORM model
    STORM_contiguous_t* storm_cont = STORM_contig_new(n_width); // STORM contiguous memory
    uint32_t* vals = malloc(n_bits_set*sizeof(uint32_t)); // array for random values

    // Foreach vector (row)
    for (int i = 0; i < n_vectors; ++i) {
        // Generate random bits for each row
        for (int j = 0; j < n_bits_set; ++j) {
            vals[j] = rand() % n_width; // draw random number
        }
        // Input data must be sorted.
        qsort(vals, n_bits_set, sizeof(uint32_t), intcmp);
        
        // Add data to either model.
        STORM_add(storm, &vals[0], n_bits_set);
        STORM_contig_add(storm_cont, &vals[0], n_bits_set);
    }

    uint64_t cstorm      = STORM_pairw_intersect_cardinality(storm);
    uint64_t cstorm_cont = STORM_contig_pairw_intersect_cardinality(storm_cont);

    printf("contig=%llu storm=%llu\n",cstorm_cont,cstorm);

    free(vals);
    STORM_free(storm);
    STORM_contig_free(storm_cont);
    return 1;
}

Compilation

StormBitmaps can be compiled using the standard cmake workflow:

cmake .
make

As with all cmake projects, you can specify the compilers you wish to use by adding (for example) -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ to the cmake command line. We tell the compiler to target the architecture of the build machine by using the -march=native flag. This can be disabled by passing the -DSTORM_DISABLE_NATIVE=ON argument to cmake.

On Linux and MacOSX, and when running with the native compilation flag, we do not need to specify the target hardware instructions set. This is not the case on Windows where we need to set these flags:

CMake Flag Description
STORM_ENABLE_SIMD_AVX512 Enable AVX512 instructions
STORM_ENABLE_SIMD_AVX2 Enable AVX256 instructions
STORM_ENABLE_SIMD_SSE4_2 Enable SSE4.2 instructions

For example, we can run cmake -DSTORM_ENABLE_SIMD_SSE4_2="ON" . to enable SSE4.2 instructions.

and run ./benchmark.

Note

This is a collaborative effort between Marcus D. R. Klarqvist (@mklarqvist) and Daniel Lemire (@lemire).

History

These functions were originally developed for Tomahawk for computing genome-wide linkage-disequilibrium but can be applied to any intersect-count problem.