Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ramblings about our (lack of) query threading model #6310

Open
teh-cmc opened this issue May 13, 2024 · 6 comments
Open

Ramblings about our (lack of) query threading model #6310

teh-cmc opened this issue May 13, 2024 · 6 comments
Labels
🧑‍💻 dev experience developer experience (excluding CI) 💬 discussion 🚀 performance Optimization, memory use, etc 🔍 re_query affects re_query itself 📺 re_viewer affects re_viewer itself

Comments

@teh-cmc
Copy link
Member

teh-cmc commented May 13, 2024

Context

A Rerun frame is divided into a bunch of non-overlapping phases.
By "phase" I mean any logical unit of work. Individual phases might be parallelized internally, but as a whole, phases are purely sequential: there is never 2 phases running concurrently.

Here's a (non-exhaustive) list of the phases that make up a frame today (focusing on the data side of things):

  1. Handle I/O and writes to datastore and related maintenance (GC, changelogs, etc).
  2. Blueprint handling and query planning.
  3. Query and space view execution.

In the before times, phase 3 ("Query and space view execution") was single-threaded, and everything was fine.

Then, Rayon was introduced into the codebase in order to provide parallelism at the space view level, using its work-stealing scheduler.
From that point on, space views were running in parallel, in a non-deterministic and potentially reentrant order.
And everything was fine, because space views were read-only back then.

Then, query caching was introduced, and more specifically it was introduced at the edge, in each individual space view, since we don't have a centralized query engine (yet).
Suddenly, space views were not read-only anymore: each space view would potentially write to a central, multi-tenant cache as it was running.
Parallel reentrant writes (both deletions and insertions) coming in a non-deterministic order into a central cache with multiple layers of granular locks is as bad news as it gets... And still, everything was fine, because space views had zero data overlap in practice (there was no easy way to specify custom queries and no blueprint from code, meaning you effectively never had any data overlap between your views in 99.5% of practical cases).

Finally, configurable queries and blueprint from code were introduced.
Suddenly, it became very easy to end up with multiple space views sharing the exact same component paths, and those previously benign concurrent/reentrant/multi-layered writes started to become a real problem (e.g. this or that).

These problems can always be fixed by adding more and more complexity, fixing one edge case after the other, but at some point you have to wonder whether that's worth the trouble at all: if these writes were problematic to begin with, then by definition it's because they were contending on the same resource -- all this added complexity is merely bringing more contention.
Rather, the issue lies in our threading-model during phase 3 -- or lack thereof: we've reached a point where our parallelism partitioning scheme (space views) has very little relation whatsoever to how the data actually flows through the app, and pain ensues.

Rayon knows nothing about how the data is supposed to flow through our pipes. We do, though.
At the end of phase 2 ("Blueprint handling and query planning"), we have all the information we need to schedule things in an optimal way that actually reflects the data flow [1].
There is still a lot of value to be had in using work-stealing schedulers though, especially in read-only scenarios, if used appropriately.

[1] That's actually not entirely true today -- we should make it so, as discussed below!

Proposal

The proposal is two-fold:

  1. We should make it so that all queries are known and planned for at the beginning of the frame, no exception.
  2. We should have appropriate threading models for the read-heavy and write-heavy parts of the query flow.

1. We should make it so that all queries are known and planned for at the beginning of the frame, no exception.

A pre-requisite to optimizing the data flow is to know at the start of the frame, before any view starts rendering (i.e. after phase 2 but before phase 3), all the queries that will need to be executed for the entire viewport.
The corollary is that views shouldn't be allowed to run queries in their execute() implementation anymore.
Rather, each view would implement a new method as part of the SpaceView trait where they would have to return all the queries they might possibly need for the upcoming frame:

trait SpaceView {
    fn queries(&self) -> Vec<Query> {
        let query1 = /* ... */;
        let query2 = /* ... */;
        vec![query1, query2]
    }

    fn execute(&self) {
        // Doesn't have access to the store and query caches anymore, but does have
        // access to the results of `query1` and `query2`.

        // ...
    }

    // ...
}

Which of these query results the view actually ends up using in practice is up to the view.
This allows the view to implement data-driven behaviors, at the cost of running some superfluous queries. This is fine: queries are cheap to run and to cache, as opposed to deserialization of results.

(This is only possible because queries have recently become untyped!)

2. We should have appropriate threading models for the read-heavy and write-heavy parts of the query flow.

Running a query is a 5-step process today:

  1. checking for and applying pending invalidations (write lock)
  2. executing the query to accumulate raw promises in the query cache (write lock)
  3. resolving the raw promises (read lock on the query cache, write lock on the promise cache)
  4. de-queueing and deserializing all the accumulated promises from the query cache to fill the deserialization cache (write and/or read lock)
  5. iterate over the deserialization cache to render the view (read lock)

Note: Step 4 & 5 are typed and therefore must run as part of the space view execution phase.

As of today, all 5 of these steps happen during the space view execution phase.
With the full query plan known ahead of the start of the frame (described just above), we now have an opportunity to split this phase into two more fine-grained phases, and implement appropriated threading strategies into each of these.

The first on these two query phases is the write-heavy phase: it encompasses steps 1 to 3 (included) described above, is executed just after the query planning phase, and just before the space view execution phase.
Since both queries and the query cache work at the entity-component level, the logical thing to do is to group all queries that share the same ComponentPath in the same task.
All tasks can then be executed through a work-stealing scheduler, without contention nor nasty edge cases.

The second query phase is the ready-heavy one and covers steps 4 & 5, and still runs as part of the space view execution phase, like it does today.
Since it is mostly reads, blind work-stealing semantics are fine and even encouraged.

So, all in all we now have:

  1. Handle I/O and writes to datastore and related maintenance (GC, changelogs, etc).
  2. Blueprint handling and exhaustive query planning.
  3. Write-heavy steps of query handling.
  4. Read-heavy steps of query handling and space view execution.
@teh-cmc teh-cmc added 💬 discussion 🧑‍💻 dev experience developer experience (excluding CI) 📺 re_viewer affects re_viewer itself 🔍 re_query affects re_query itself 🚀 performance Optimization, memory use, etc labels May 13, 2024
@jleibs
Copy link
Member

jleibs commented May 13, 2024

Nice writeup. This proposal makes a lot of sense. We should do a passthrough of the code and see how many places we currently violate the proposed pre-determined query constraint. The sooner we can lock ourselves into this mental model the better.

The other thing I'm still curious about is the future path to doing some kind of filtering or transformation. For many cases this probably pushes us in the direction of over-querying + doing more work in-visualizer. But is there a path to declaring that you're interested in a filtered view of the data?

@teh-cmc
Copy link
Member Author

teh-cmc commented May 14, 2024

The other thing I'm still curious about is the future path to doing some kind of filtering or transformation. For many cases this probably pushes us in the direction of over-querying + doing more work in-visualizer. But is there a path to declaring that you're interested in a filtered view of the data?

Yeah I think that can happen naturally over time with incremental improvements to the query DSL.
Every little feature you add to the DSL is that much logic you can get out of the view, and therefore that much data you don't need to over-query.
That's the theory, at least 🙄

@Artxiom
Copy link
Contributor

Artxiom commented May 15, 2024

I'm new here and aware that this is a very complicated topic, but I solved a related issue in a previous project I was working on. Have you ever considered using persistent data types?
There is a really great crate for that:
https://docs.rs/im

It's all based on hash array mapped tries, B-trees and RRB-tree vectors - they have some really cool properties and in my experience this solves 99.9% of all concurrency related issues. You can basically get rid of all locking and everything just becomes simple and nice again 😄

In practice it works a lot like RCU in the Linux kernel: everyone (thread) has a mutable copy of the data so mutating it doesn't affect anyone else, it's internally mutable but externally read-only. So whenever you make a change and want it to be visible to everyone else you just replace an atomic pointer with the new state, whenever you read from it again you get a new updated state. And it's all super cheap to copy (basically just an Arc<T>), quite efficient overall (see chart in the im crate), thread-safe and thus easily parallelizable, doesn't use any locks, has implicit state sharing for free, etc. It's really smart.
There is some overhead though of course (mostly on the memory side), but it's pretty negligible in the grand scheme of things in my experience.

So really worth looking into imo!

@teh-cmc
Copy link
Member Author

teh-cmc commented May 15, 2024

Great thing that you're bringing up copy-on-write datastructures, they always slip my mind 👍

The specific caching datastructures mentioned in this issue exist pretty much solely for cache locality purposes though, so any tree-based datastructure is out of the picture (think e.g. crunching through millions of scalars for plot rendering). You could of course amortize things by making each entry its own contiguous bucket of data etc, but then you pretty much end up back where we are today (bucket splits & merges require synchronization across several keys, not just pointer swaps, and that's before you get into the whole in-place bucket modifications business...).

Similarly, our lack of a data-driven threading model still needs to be addressed in any case: low-level workarounds won't fix a flawed design, and contending on atomics isn't really any better than contending on locks anyhow.

That being said, there are a lot of places in the app that already deal exclusively with buckets/chunks of data, and those places could potentially benefit from CoW datastructures: if only to not have to deal with these nasty mutex guards that always turn into a usability and maintenance nightmare in any non-trivial scenario (and we have a lot of those).
With the upcoming data model changes, there will likely be even more of these places around. In fact it's possible that at some point every piece of the app will be like that.

There might be some maintenance/usability wins to be had here.
Anyways, I'll keep that in mind, thanks!

@Artxiom
Copy link
Contributor

Artxiom commented May 15, 2024

Thanks for clarifying, that's definitely a very interesting challenge 🤓
I wouldn't dismiss RRB-trees though, the name is very misleading - they are more like chunked vectors wrapped in a very flat(!) tree and are very cache-friendly.
Seems with the upcoming model changes you want do do anyway something similar? 😄
RRB-trees were mainly made for efficient crunching of huge datasets in functional languages and with concurrency in mind. It's really just a "concurrent-vector". Sequential iteration, for example, is almost identical to it's underlying data structure - plain vectors. Don't you have anyway mostly sequential log data?
There is really just one scenario where performance would suffer: rapid and frequent modifications, because of all the allocations ...but also only if the modifications are concurrent. On a single thread you can use something called transients and reduce allocations to a minimum, comparable to a standard vector.

Here is the original Clojure paper with benchmarks:
https://core.ac.uk/download/pdf/147975401.pdf

Hope I didn't misunderstand...there is also not really much contention or synchronization going on with persistent data structures - besides the usual Arc<T> atomic reference count, which you will probably have anyway somewhere. With the right design you don't even need to store the "main state" in a shared pointer - you basically just copy around Arc<T>s between threads.

Also one huge advantage is that you can concurrently access data without any cache coherence issues - because it's read only the CPU caches don't get invalidated through writes. With mutable state this is impossible to achieve. If you think about it it also makes sense: the only way for two threads to safely access the same data without any synchronization(!) is with read-only data.

@Artxiom
Copy link
Contributor

Artxiom commented May 15, 2024

@teh-cmc I'm sorry - I did some benchmarks out of curiosity and im::Vector is factor two slower than VecDeque for sequential iteration on my machine. I'm not sure why though - it may be just an implementation issue, I have used RRB-trees in the past (in C++ though) and it was definitely faster. Factor two is also not huge but yeah, for some applications it may be not fast enough.

add.
This didn't let me rest 😄 So I dug deeper. It seems it's just slower because they hard-coded a branching factor for the tree and the M1 Max has huge CPU caches - using bigger chunks of data for the im-vector, which would be anyway the normal case in a real-world scenario, results in almost identical performance to a std::vec::Vec for sequential iteration. There seems to be only a small performance hit sometimes with very small vectors for some strange reason.

It really shines in a concurrent setting though: std::vec::Vec seems to be a tiny little bit faster but when you look at the worst case, it's really terrible (contention with increased thread size). It just doesn't scale. The immutable vector on the other hand doesn't care, it's super stable across the board.
Push at the end is also faster but that's probably only because of the internal pre-allocated memory pool of the im-vector.

vec.bench.txt

Anyway, this is just to show that immutable data structures can be efficient - their properties make them auto-scalable. I was actually always wondering how an immutable ECS could look like - maybe I will try to implement it one day :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧑‍💻 dev experience developer experience (excluding CI) 💬 discussion 🚀 performance Optimization, memory use, etc 🔍 re_query affects re_query itself 📺 re_viewer affects re_viewer itself
Projects
None yet
Development

No branches or pull requests

3 participants