Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sshash module superkmer anno #497

Closed
wants to merge 56 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
eca1e79
Initial DBGSSHash commit
adamant-pwn Nov 6, 2023
102bf31
Use SSHash submodule instead of PTHash
adamant-pwn Nov 8, 2023
5fbbc3a
trying to use sshash as a submodule
mmarzett Nov 23, 2023
7ff3382
fixed index conversion between ssh and metagraph
mmarzett Nov 30, 2023
52bb957
some changes to sshash, first try at avoiding the compile error in ss…
hmusta Dec 6, 2023
65c5d07
changes for unit_test usage
mmarzett Dec 7, 2023
e6800be
new sshash version
mmarzett Dec 7, 2023
885ccd3
ignore strict aliasing warning
mmarzett Jan 11, 2024
127ef00
using sshash from cli
mmarzett Jan 11, 2024
f21eace
general path
mmarzett Jan 11, 2024
49e68cd
fixed serialization and loading
mmarzett Jan 18, 2024
f95b74c
no longer need k.txt file + clean-up
mmarzett Jan 23, 2024
7b242cf
disabled tests that are incompatible with sshash
mmarzett Jan 23, 2024
955129e
stats changes
mmarzett Apr 2, 2024
d65be03
stats file
mmarzett Apr 2, 2024
3245236
added plotting scripts and improved input and output for stats
mmarzett Apr 3, 2024
6c3df19
annotate superkmers
mmarzett Apr 10, 2024
49829df
sshash with functions for superkmer mapping
mmarzett Apr 10, 2024
8ad8894
fixed minimizer bug
mmarzett Apr 11, 2024
54a8a8e
fixed bug in skew index case
mmarzett Apr 12, 2024
2373fb8
added superkmer mask for lossless annotation
mmarzett Apr 18, 2024
88c9fc8
loading/serializing bit vector
mmarzett Apr 22, 2024
953e3f1
small fixes
mmarzett Apr 22, 2024
4c9455e
mapping to kmers
mmarzett Apr 22, 2024
b522bdc
removed superkmer stats
mmarzett Apr 22, 2024
bf74651
changes to decrease superkmer mask runtime
mmarzett Apr 24, 2024
05ff9ab
removed unnecessary line
mmarzett Apr 24, 2024
7eec588
getting superkmer label batches
mmarzett Apr 26, 2024
e205690
sshash version
mmarzett Apr 26, 2024
d9b7f0b
tiny bug fix
mmarzett Apr 27, 2024
4b3b84f
paralellized bit vector construction
mmarzett Apr 28, 2024
b973597
splitting and merging of superkmer vector for parallelization
mmarzett Apr 29, 2024
a6a0f7f
pure superkmer annotation -> with information loss
mmarzett Apr 29, 2024
361408e
removed 2 extra lookup calls added during debugging
mmarzett May 1, 2024
d6b40ae
more efficient query for kmers in non-monochromatic superkmers
mmarzett May 1, 2024
69a408d
Merge branch 'sshash_module_superkmer_anno' of github.com:ratschlab/m…
mmarzett May 1, 2024
46e280d
removed 2 extra lookup calls added during debugging
mmarzett May 2, 2024
37fbec4
changed annotation type for superkmer mask construction
mmarzett May 3, 2024
5622357
fixed annotation type
mmarzett May 3, 2024
a2997de
sshash both for superkmer bit vec construction and mapping
mmarzett May 6, 2024
c97f396
sshash both for superkmer bit vec construction and mapping
mmarzett May 6, 2024
ce691b2
sshash both for superkmer bit vec construction and mapping
mmarzett May 6, 2024
3e3b885
reverse complement bug fix
mmarzett May 7, 2024
e60ab9f
returning kmer index during superkmer lookup
mmarzett May 12, 2024
6f39143
use npos
mmarzett May 12, 2024
e399df9
Merge branch 'sshash_module_kmer_anno' into sshash_module_superkmer_anno
mmarzett May 16, 2024
afca9e7
Merge branch 'sshash_module_lossy_superkmer_anno' into sshash_module_…
mmarzett May 16, 2024
3da03ad
annotation modus given by flag
mmarzett May 16, 2024
8a23844
added stats
mmarzett May 16, 2024
7195300
clean-up
mmarzett May 24, 2024
5b251ad
plotting and stats scripts updated
mmarzett May 24, 2024
c2c8030
Merge branch 'sshash_module_superkmer_anno'
adamant-pwn Jul 25, 2024
efce431
Don't disable tests for sshash
adamant-pwn Jul 25, 2024
4f86936
Don't disable tests for sshash
adamant-pwn Jul 25, 2024
ef4bcf6
Update sshash
adamant-pwn Jul 25, 2024
f25f2ea
Update sshash
adamant-pwn Jul 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions metagraph/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -545,6 +545,21 @@ target_link_libraries(unit_tests gtest_main gtest gmock metagraph-core metagraph

target_compile_options(unit_tests PRIVATE -Wno-uninitialized ${DEATH_TEST_FLAG})

#-------------------
# superkmer bit vector
#-------------------
add_executable(superkmer_stats scripts/run_superkmer_stats.cpp)
target_include_directories(superkmer_stats PRIVATE src/graph/representation/hash
src/graph/
src/annotation/representation/base/
src/cli/config
src/cli/load
src/common/utils/
)

target_link_libraries(superkmer_stats PRIVATE metagraph-core metagraph-cli ${METALIBS})
target_compile_options(superkmer_stats PRIVATE -Wno-unused-parameter)

#-------------------
# Benchmarks
#-------------------
Expand Down
162 changes: 162 additions & 0 deletions metagraph/scripts/plot_file_sizes.ipynb

Large diffs are not rendered by default.

252 changes: 252 additions & 0 deletions metagraph/scripts/plot_query_bench.ipynb

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions metagraph/scripts/plot_query_bench_by_anno_mode.ipynb

Large diffs are not rendered by default.

45 changes: 45 additions & 0 deletions metagraph/scripts/run_superkmer_stats.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#include "graph/representation/hash/dbg_sshash.hpp"
#include "graph/annotated_dbg.hpp"
#include "annotation/representation/base/annotation.hpp"

#include "cli/load/load_annotation.hpp"
#include "annotation/representation/column_compressed/annotate_column_compressed.hpp"
#include "annotation/representation/row_compressed/annotate_row_compressed.hpp"

#include "string_utils.hpp"

#include "annotation/representation/annotation_matrix/static_annotators_def.hpp"
#include "annotation/binary_matrix/base/binary_matrix.hpp"

int main (int argc, char *argv[]){

using namespace mtg::graph;
using namespace mtg::annot;
using namespace mtg::annot::matrix;
using namespace mtg::cli;

if(argc <3){
std::cerr<<"missing input files!\n" ;//<< "graph path, annotation path\n";
return EXIT_FAILURE;
}
std::string graph_path = argv[1];
std::string anno_path = argv[2];

std::shared_ptr<mtg::graph::DBGSSHash> graph_ptr = std::make_shared<mtg::graph::DBGSSHash>(31);
graph_ptr->load(graph_path);

std::string sk_mask_path = utils::remove_suffix(graph_path, graph_ptr->kExtension) + "_sk_mask";

// Warning: Unused variable...
//Config::AnnotationType config = parse_annotation_type(anno_path);
//assert(config == Config::AnnotationType::RowFlat);

std::unique_ptr<RowFlatAnnotator> anno_ptr = std::make_unique<RowFlatAnnotator>();
anno_ptr->load(anno_path);
std::unique_ptr<AnnotatedDBG> anno_graph = std::make_unique<AnnotatedDBG>(graph_ptr, std::move(anno_ptr), false);

//graph_ptr->superkmer_stats(anno_graph);
graph_ptr->superkmer_bv(anno_graph, sk_mask_path);

return 0;
}
204 changes: 204 additions & 0 deletions metagraph/scripts/superkmer_stats.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions metagraph/src/cli/config/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -819,6 +819,7 @@ Config::GraphType Config::string_to_graphtype(const std::string &string) {

} else if (string == "sshash") {
return GraphType::SSHASH;

} else {
std::cerr << "Error: unknown graph representation" << std::endl;
exit(1);
Expand Down
77 changes: 77 additions & 0 deletions metagraph/src/graph/representation/hash/dbg_sshash.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,24 @@ template
std::pair<DBGSSHash::node_index, bool>
DBGSSHash::kmer_to_node_with_rc<false>(std::string_view) const;

DBGSSHash::node_index DBGSSHash::kmer_to_node_from_superkmer(std::string_view kmer, uint64_t s_id, bool check_reverse_complement) const {
uint64_t ssh_idx = std::visit([&](auto const& dict) {
return dict.look_up_from_superkmer_id(s_id, kmer.begin(), check_reverse_complement);
}, dict_);
return ssh_idx + 1;
}

std::tuple<uint64_t, uint64_t, uint64_t> DBGSSHash::kmer_to_superkmer_node(std::string_view kmer) const {
auto [kmer_idx, superkmer_idx, superkmer_id] = std::visit([&](auto const& dict) {
return dict.kmer_to_superkmer_idx(kmer.begin(), true);
}, dict_);
if(kmer_idx == sshash::constants::invalid_uint64){
return {npos, npos, sshash::constants::invalid_uint64};
}
// switch to DBG index
return {kmer_idx + 1, superkmer_idx + 1, superkmer_id};
}

DBGSSHash::node_index DBGSSHash::kmer_to_node(std::string_view kmer) const {
if (mode_ == CANONICAL) {
auto res = kmer_to_node_with_rc<true>(kmer);
Expand Down Expand Up @@ -378,6 +396,12 @@ bool DBGSSHash::load(std::istream &in) {
if (num_nodes_)
std::visit([&](auto &d) { d.visit(loader); }, dict_);

if(annotation_mode == 2) {
// TODO: How to load annotations from istream?
std::string s_mask_name = /*utils::remove_suffix(filename, kExtension) +*/ "_sk_mask";
load_superkmer_mask(s_mask_name);
}

return true;
}

Expand All @@ -387,5 +411,58 @@ bool DBGSSHash::load(const std::string& filename) {
return load(fin);
}

sdsl::bit_vector mask_into_bit_vec(const std::vector<bool>& mask){
sdsl::bit_vector bv (mask.size());
for(size_t idx = 0; idx < mask.size(); idx++){
bv[idx] = mask[idx];
}
return bv;
}

void DBGSSHash::load_superkmer_mask(std::string file){
loaded_mask = load_from_file(superkmer_mask, file);
std::cout<< " successfully loaded " << file<<"?: " <<loaded_mask << std::endl;
}

void DBGSSHash::superkmer_stats(const std::unique_ptr<AnnotatedDBG>& anno_graph) const{
if(annotation_mode != 0){
throw std::runtime_error("Computing super-k-mer stats in wrong annotation mode!");
}
std::cout<< "Computing super-k-mer statistics ... \n";
std::visit([&](auto const& dict) {
dict.make_superkmer_stats(([&anno_graph](std::string_view str){return anno_graph->get_labels(str);}));
}, dict_);
}
void DBGSSHash::superkmer_bv(const std::unique_ptr<AnnotatedDBG>& anno_graph, std::string file_sk_mask) const{
if(annotation_mode != 0){
throw std::runtime_error("Building super-k-mer bit vector in wrong annotation mode!");
}
std::cout<< "Building super-k-mer bit vector... \n";
//getting labels in batches
size_t num_labels = anno_graph->get_annotator().num_labels();
std::vector<uint64_t> superkmer_idxs = std::visit([&](auto const& dict) {
return dict.build_superkmer_bv([&anno_graph, num_labels](std::string_view sequence) {
auto labels = anno_graph->get_top_label_signatures(sequence, num_labels);
// since get_top_label_signatures returns only labels that were found,
// if any entry is zero not all the kmers share all the same labels -> return false
for(auto pair : labels){
sdsl::rank_support_v rs;
sdsl::util::init_support(rs,&pair.second);
size_t bit_vec_size = pair.second.size();
if(rs(bit_vec_size) != bit_vec_size) return false;
}
return true;
});
}, dict_);

// elias fano encoding
sdsl::sd_vector<> ef_bv (superkmer_idxs.begin(), superkmer_idxs.end());

std::cout << "serializing bit vector..." << std::endl;
bool check = store_to_file(ef_bv, file_sk_mask);
std::cout<< " successfully stored " << file_sk_mask<<"?: " <<check<<std::endl;

}

} // namespace graph
} // namespace mtg
12 changes: 12 additions & 0 deletions metagraph/src/graph/representation/hash/dbg_sshash.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
#include <dictionary.hpp>
#include <sdsl/uint256_t.hpp>

#include "graph/annotated_dbg.hpp"
#include "graph/representation/base/sequence_graph.hpp"
#include "sdsl/bit_vectors.hpp"

namespace mtg::graph {

Expand Down Expand Up @@ -83,6 +85,7 @@ class DBGSSHash : public DeBruijnGraph {
size_t indegree(node_index) const override;

node_index kmer_to_node(std::string_view kmer) const override;
node_index kmer_to_node_from_superkmer(std::string_view kmer, uint64_t s_id, bool check_reverse_complement) const;

template <bool with_rc = true>
std::pair<node_index, bool> kmer_to_node_with_rc(std::string_view kmer) const;
Expand Down Expand Up @@ -110,12 +113,21 @@ class DBGSSHash : public DeBruijnGraph {

node_index reverse_complement(node_index node) const;

std::tuple<uint64_t, uint64_t, uint64_t> kmer_to_superkmer_node(std::string_view kmer) const ;

void load_superkmer_mask(std::string file);
void superkmer_bv(const std::unique_ptr<AnnotatedDBG>& anno_graph, std::string file_sk_mask) const;
void superkmer_stats(const std::unique_ptr<AnnotatedDBG>& anno_graph) const;

private:
static const std::string alphabet_;
dict_t dict_;
size_t k_;
size_t num_nodes_;
Mode mode_;
int annotation_mode = 0; // 0: kmer annotation, 1: lossy superkmer annotation, 2: superkmer annotation with superkmer mask
bool loaded_mask = false;
sdsl::sd_vector<> superkmer_mask;

size_t dict_size() const;
};
Expand Down
3 changes: 2 additions & 1 deletion metagraph/tests/annotation/test_annotated_dbg_helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@

#include "test_matrix_helpers.hpp"

#include "../graph/all/test_dbg_helpers.hpp"

#include "graph/representation/hash/dbg_hash_fast.hpp"
#include "graph/representation/bitmap/dbg_bitmap.hpp"
Expand All @@ -14,6 +13,8 @@
// this next #include includes AnnotatedDBG. we need access to its protected
// members to modify the underlying annotator
#define protected public
#include "../graph/all/test_dbg_helpers.hpp"

#include "annotation/annotation_converters.hpp"
#include "annotation/representation/annotation_matrix/static_annotators_def.hpp"

Expand Down
Binary file not shown.
2 changes: 0 additions & 2 deletions metagraph/tests/graph/all/test_dbg_contigs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,6 @@ TYPED_TEST(DeBruijnGraphTest, CallPaths) {
std::vector<std::string>({ "AAACTCGTAGC", "AAATGCGTAGC" }),
std::vector<std::string>({ "AAACT", "AAATG" }),
std::vector<std::string>({ "ATGCAGTACTCAG", "ATGCAGTAGTCAG", "GGGGGGGGGGGGG" }) }) {

auto graph = build_graph_batch<TypeParam>(k, sequences);

// in stable graphs the order of input sequences
Expand Down Expand Up @@ -485,7 +484,6 @@ TYPED_TEST(DeBruijnGraphTest, CallUnitigs) {
std::vector<std::string>({ "AAACTCGTAGC", "AAATGCGTAGC" }),
std::vector<std::string>({ "AAACT", "AAATG" }),
std::vector<std::string>({ "ATGCAGTACTCAG", "ATGCAGTAGTCAG", "GGGGGGGGGGGGG" }) }) {

auto graph = build_graph_batch<TypeParam>(k, sequences);

// in stable graphs the order of input sequences
Expand Down
3 changes: 1 addition & 2 deletions metagraph/tests/graph/all/test_dbg_helpers.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -172,8 +172,7 @@ build_graph<DBGSSHash>(uint64_t k,
std::string dump_path = "../tests/data/sshash_sequences/contigs.fa";
writeFastaFile(contigs, dump_path);

std::shared_ptr<DBGSSHash> graph;
graph = std::make_shared<DBGSSHash>(dump_path, k, mode, num_chars);
auto graph = std::make_shared<DBGSSHash>(dump_path, k, mode, num_chars);

if (mode == DeBruijnGraph::PRIMARY)
return std::make_shared<CanonicalDBG>(
Expand Down
Loading