Replies: 8 comments
-
It might be better to read the secondary key build data files in a separated thread[s] and wait for them before starting building the secondary keys (if we use the approach with separated index recovery data files). |
Beta Was this translation helpful? Give feedback.
-
Addresses or offsets? The example suggests offsets. I'd use addresses different from 1, 2, 3 to avoid confusion. |
Beta Was this translation helpful? Give feedback.
-
What's the point of wrapping the chunks in Anyway, it'd be nice to estimate the size of such a file for a typical index and the amount of time it'll take to load it in memory. |
Beta Was this translation helpful? Give feedback.
-
It'd be good to do a small experiment proving the viability of the new approach:
(AFAIU using a hash table isn't really cache-friendly; I'm curious what performance impact of random accesses would be) |
Beta Was this translation helpful? Give feedback.
-
Before replace triggers don't break snapshot creation process - they break only recovery since they can modify tuples, hence, reorder tuples in secondary indexes. So we shouldn't use Also, please, mention |
Beta Was this translation helpful? Give feedback.
-
Could you please provide more information about how and when you will use these To be more specific, let's say you have two spaces:
By the way, does building secondary indexes only after WAL recovery really provides any sensible performance gains? It would be simpler to always build secondary indexes after snapshot recovery. |
Beta Was this translation helpful? Give feedback.
-
I would elaborate a bit. The idea is to have a "fake" space (let's call it Such space shouldn't actually store anything in-memory, only process the tuples in a system So, basically, the flow is:
|
Beta Was this translation helpful? Give feedback.
-
As I understood there are variants that are possible to implement without any options. For this we need to ensure that the secondary keys order data don't break old tarantool versions anyhow (including waste of memory). I think that we can consider the following variants:
In my opinion, backhole engine/NOP operations are tricks, while holding the extra data in separate files looks like a direct solution. So, it seems better for me. OTOH, we should verify how snapshotting time is changed in case of several secondary indexes. If it becomes much longer, maybe the option to refuse to write it is useful (but I'm not sure). It is also important to take into account that modern SSD is able to serve several parallel write requests with the speed near to the single write. I think that parallel write of several files may be very profitable here.
Technically speaking, shouldn't we check presence of such triggers while reading the data, not writing? I guess that if we have the trigger on writing a snapshot, we likely have it after restart too, but there is no such a guarantee. |
Beta Was this translation helpful? Give feedback.
-
Reviewers
The problem
The issue is descibed in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has
n*log(n)
complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity toO(n)
(this, in theory, should speed-up recovery of secondary keys).The algrithm
As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:
[0xa3, 0xb2, 0xc1]
. Then we know that the first tuple in primary index had address0x01
and its new address is0xa3
, the second one had0x02
and the new address is0xb2
and so on. So we can easily build the mapping:[0x01, 0x03, 0x02]
and now it becomes[0xa3, 0xc1, 0xb2]
- array of actual tuples in required order.The storage
One could think to insert the information described in the
.snap
file itself, but there's few options to that:Another option is to store the information in separated file[s]. It's already done in Vinyl, so the proposal is to store the snapshot files similar way in
memtx_dir
. So, for each seconady key of a space the following structure is to be created:<space_id>/<index_id>/<recovery_data_file>
.Option 0: separated files
This one is simple: just store a number of
MP_STRING
entries specifying the index order data (a binary sequence of 8-byte pointers wrappind into a string header) in<memtx_dir>/<space_id>/<index_id>/<vclock_signature>.index
files (user may be afraid to delete an.index
file, whereas he shouldn't, so better naming suggestions are appreciated). These files are created along with the regular snapshot file inmemtx_engine_begin_checkpoint
, but only for TREE index and if the space has nobefore_replace
triggers.The entries to store are
MP_STRING
instead of arrays ofMP_UINT
in order to minimize the data validation time.The optimal amount of tuple data to store in a single
MP_STRING
sorting data batch is to be investigated.The data is started to load on the initial recovery start if no
force_recovery
specified and used on the secondary indexes build step and we continue to read next chunks after eachdata_batch_size
replaces.Option 1: a system space
The idea is to save the information in a system space: create the space on startup and remove it after recovery.
box.snapshot()
after the downgrade (for backward compatibility).Haven't investigated this one thoroughly, so no technical details for now.
The configuration
A new configuration variable is proposed:
memtx_use_sk_recovery_data
. If the variable is disabled, we sort the secondary key tuples as before and don't write recovery data on snapshot.Changelog
2. The data is stored and loaded with batches of a size yet to be specified.
Beta Was this translation helpful? Give feedback.
All reactions