Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RowDiff with DBGHashFast in test_aligner_labeled #505

Open
wants to merge 22 commits into
base: rowdiff
Choose a base branch
from

Conversation

adamant-pwn
Copy link
Contributor

Also uses in_graph(node) checks more extensively throughout the code base and serializes get_last(graph) instead of rd_succ for non-BOSS graphs.

if(dynamic_cast<graph::DBGSuccinct const*>(&graph)) {
rd_succ_bv_type().serialize(f);
} else {
rd_succ_bv_type(*rd_succ).serialize(f);
Copy link
Contributor Author

@adamant-pwn adamant-pwn Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the best way to avoid copy-construction here, and if it's generally possible for typical results of get_last. Should I try to do dynamic_cast to rd_succ_bv_type, and only if it fails use the constructor?

} else { // Only happens in DBGSuccinct
auto *dbg_succ = dynamic_cast<graph::DBGSuccinct const*>(graph_);
assert(dbg_succ);
node = row_diff_successor(*graph_, node, dbg_succ->get_boss().get_last());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be problematic to, instead of doing this, assign get_last() to fork_succ during its load, if it was detected that it is empty? I assume in this case we would also need to change fork_succ_ itself into a pointer to bit_vector?

} else {
logger->warn("No count vectors could be found in {}. The last outgoing"
" edges will be selected for assigning RowDiff successors",
count_vectors_dir);
rd_succ = last;
if(dynamic_cast<graph::DBGSuccinct const*>(&graph)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like doing this. If we try to serialize last, what exactly is the issue with conversion? How much of a slowdown would it be? Alternatively, would it be an option to make some general bit_vector load and add class tags to serialized versions?

@@ -57,33 +57,33 @@ class DBGHashFastImpl : public DBGHashFast::DBGHashFastInterface {
}

void add_sequence(std::string_view sequence,
const std::function<void(node_index)> &on_insertion);
const std::function<void(node_index)> &on_insertion) override final;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did it compile without them? I couldn't find the answer online...

if (valid_edges_) {
serialize_valid_edges(valid_edges_);
} else {
serialize_valid_edges(generate_dummy_kmers(1, false));
Copy link
Contributor Author

@adamant-pwn adamant-pwn Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new indexing, I really don't want to allow these graphs to walk around with invalid in_graph function. Would lead (and leads if this is removed) to all kinds of hard to debug behaviors. This is not ideal, but at least guarantees it will be valid when the graph is loaded and used somewhere else. But I'm also not sure what's the best place to propagate this information, and how much overhead it may add?

@@ -366,7 +363,7 @@ def test_simple_align_edit_distance_all_graphs(self, representation):

params = self._get_stats(self.tempdir.name + '/genome.MT' + graph_file_extension[representation])
self.assertEqual('11', params['k'])
self.assertEqual('16461', params['nodes (k)'])
self.assertEqual('16438', params['nodes (k)'])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number decreases if valid_edges is generated and serialized for DBGSuccinct. Probably because dummy nodes are excluded this way. Same with the change above.

@adamant-pwn adamant-pwn requested a review from hmusta October 16, 2024 22:48
Copy link
Collaborator

@hmusta hmusta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked through everything yet, but here's my first batch of comments

metagraph/tests/annotation/test_converters.cpp Outdated Show resolved Hide resolved
metagraph/tests/annotation/test_converters.cpp Outdated Show resolved Hide resolved
metagraph/src/graph/representation/base/sequence_graph.cpp Outdated Show resolved Hide resolved
metagraph/integration_tests/main.py Outdated Show resolved Hide resolved
metagraph/integration_tests/main.py Outdated Show resolved Hide resolved
metagraph/src/graph/representation/hash/dbg_hash_fast.hpp Outdated Show resolved Hide resolved
@adamant-pwn adamant-pwn requested a review from hmusta November 4, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants