Save order of secondary indexes to snapshot #11001

mkostoevr · 2024-12-26T09:58:11Z

mkostoevr
Dec 26, 2024
Collaborator

Reviewers

The problem

The issue is descibed in #10847: since we save tuples in snapshot in PK order, we have to sort them to build each secondary key (and the sorting process has n*log(n) complexity). Let's save the order of secondary keys in the snapshot to reduce the build complexity to O(n) (this, in theory, should speed-up recovery of secondary keys).

The algrithm

As the issue suggests, let's write the order of secondary indexes to the snapshot. The algorithm is:

Iterate over each space index and write tuple addresses in its order. After this step, such entry in snapshot will appear:

+---------------------+
| [0x01, 0x02, 0x03]  | - primary index
| [0x01, 0x03, 0x02]  | - secondary index
+---------------------+

After recovering the primary index (or even during its recovery), we can use its serialized representation to build a mapping (hash-table) from the old tuple addresses to the new ones. Imagine that the new primary index has such tuple addresses in this order: [0xa3, 0xb2, 0xc1]. Then we know that the first tuple in primary index had address 0x01 and its new address is 0xa3, the second one had 0x02 and the new address is 0xb2 and so on. So we can easily build the mapping:

{0x01: 0xa3, 0x02: 0xb2, 0x03: 0xc1}

Update all arrays of orders in secondary indexes using the tuple address mapping - and now we have an array of actual tuples sorted in a secondary index order, so we don't need to use parallel qsort and can simply build an index. In our example, the secondary index in snapshot was [0x01, 0x03, 0x02] and now it becomes [0xa3, 0xc1, 0xb2] - array of actual tuples in required order.

The storage

One could think to insert the information described in the .snap file itself, but there's few options to that:

Not viable: save the information in the metadata of the snapshot file: this way we're limited by the max metadata header size, which is pretty small (~2KB as for ef3775a).
Not viable: Save the information as one of entries in the snapshot file: we can't create a new fixheader type since it's strictly checked. A new request type can't be used in it either, cause memtx only accepts inserts (along with RAFT stuff) there.
[@drewdzzz] Save the information in a new system space prior to the snapshot creation and drop it after.

Another option is to store the information in separated file[s]. It's already done in Vinyl, so the proposal is to store the snapshot files similar way in memtx_dir. So, for each seconady key of a space the following structure is to be created: <space_id>/<index_id>/<recovery_data_file>.

Option 0: separated files

This one is simple: just store a number of MP_STRING entries specifying the index order data (a binary sequence of 8-byte pointers wrappind into a string header) in <memtx_dir>/<space_id>/<index_id>/<vclock_signature>.index files (user may be afraid to delete an .index file, whereas he shouldn't, so better naming suggestions are appreciated). These files are created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index and if the space has no before_replace triggers.

The entries to store are MP_STRING instead of arrays of MP_UINT in order to minimize the data validation time.
The optimal amount of tuple data to store in a single MP_STRING sorting data batch is to be investigated.

The data is started to load on the initial recovery start if no force_recovery specified and used on the secondary indexes build step and we continue to read next chunks after each data_batch_size replaces.

Option 1: a system space

The idea is to save the information in a system space: create the space on startup and remove it after recovery.

Have the data required for sorting secondary keys in the space in the snapshot.
If the snapshot does not contain the space, the the recovery is performed the old way (compatibility with old snapshots).
Don't write the space on box.snapshot() after the downgrade (for backward compatibility).

Haven't investigated this one thoroughly, so no technical details for now.

The configuration

A new configuration variable is proposed: memtx_use_sk_recovery_data. If the variable is disabled, we sort the secondary key tuples as before and don't write recovery data on snapshot.

Changelog

Date	Description
26.12.2024 13:00	Initial version
09.01.2025 12:30	1. Store MP_STRING of binary data in separated file option instead of MP_ARRAY. 2. The data is stored and loaded with batches of a size yet to be specified.

mkostoevr · 2025-01-09T14:14:09Z

mkostoevr
Jan 9, 2025
Collaborator Author

It might be better to read the secondary key build data files in a separated thread[s] and wait for them before starting building the secondary keys (if we use the approach with separated index recovery data files).

0 replies

locker · 2025-01-09T15:49:38Z

locker
Jan 9, 2025
Maintainer

Iterate over each space index and write tuple addresses in its order. After this step, such entry in snapshot will appear:
+---------------------+
| [0x01, 0x02, 0x03]  | - primary index
| [0x01, 0x03, 0x02]  | - secondary index
+---------------------+

Addresses or offsets? The example suggests offsets. I'd use addresses different from 1, 2, 3 to avoid confusion.

0 replies

locker · 2025-01-09T15:52:29Z

locker
Jan 9, 2025
Maintainer

This one is simple: just store a number of MP_STRING entries specifying the index order data (a binary sequence of 8-byte pointers wrappind into a string header)

What's the point of wrapping the chunks in MP_STRING. Why not store an array of 8-byte addresses in the file without any header?

Anyway, it'd be nice to estimate the size of such a file for a typical index and the amount of time it'll take to load it in memory.

0 replies

locker · 2025-01-09T16:01:54Z

locker
Jan 9, 2025
Maintainer

Update all arrays of orders in secondary indexes using the tuple address mapping - and now we have an array of actual tuples sorted in a secondary index order, so we don't need to use parallel qsort and can simply build an index. In our example, the secondary index in snapshot was [0x01, 0x03, 0x02] and now it becomes [0xa3, 0xc1, 0xb2] - array of actual tuples in required order.

It'd be good to do a small experiment proving the viability of the new approach:

Sort a large (> 1Gb) array of randomly shuffled integers in range 1..N using our version of quick sort.
Sort the same array using the merge sort algorithm.
Sort the same array using integer sorting: iterate over the input array and store value input[i] at output[input[i]].
Compare the execution time.

(AFAIU using a hash table isn't really cache-friendly; I'm curious what performance impact of random accesses would be)

0 replies

drewdzzz · 2025-01-09T17:51:05Z

drewdzzz
Jan 9, 2025
Collaborator

These files are created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index and if the space has no before_replace triggers.

Before replace triggers don't break snapshot creation process - they break only recovery since they can modify tuples, hence, reorder tuples in secondary indexes. So we shouldn't use *.index files on recovery of a space with before_replace triggers instead.

Also, please, mention force_recovery - we shouldn't use *.index if the snapshot is broken as well.

0 replies

drewdzzz · 2025-01-09T18:13:13Z

drewdzzz
Jan 9, 2025
Collaborator

Could you please provide more information about how and when you will use these *.index files.

To be more specific, let's say you have two spaces: space1 with TREE and HASH secondary indexes, space2 with only HASH secondary index. What indexes will be built after snapshot recovery? Variants I consider:

Only TREE of space1 is built after snapshot recovery since it's the only index with saved order. All other indexes will be built after WAL is recovered.
All indexes of space1 will be built after snapshot recovery since it has at least one pre-saved secondary index. All indexes of space2 will be built after WAL recovery.
All indexes of all Tarantool spaces will be built after snapshot recovery. We do that because we have at least one pre-saved index for the whole database.

By the way, does building secondary indexes only after WAL recovery really provides any sensible performance gains? It would be simpler to always build secondary indexes after snapshot recovery.

0 replies

drewdzzz · 2025-01-09T18:42:08Z

drewdzzz
Jan 9, 2025
Collaborator

Option 1: a system space
The idea is to save the information in a system space: create the space on startup and remove it after recovery.
Have the data required for sorting secondary keys in the space in the snapshot.
If the snapshot does not contain the space, the the recovery is performed the old way (compatibility with old snapshots).
Don't write the space on box.snapshot() after the downgrade (for backward compatibility).

I would elaborate a bit. The idea is to have a "fake" space (let's call it _index_order for now) in snapshot that will store tuples with index order. For example, such tuples would have format [space_id, index_id, part_id, [<tuple addresses...>]]. Order of the whole index will be stored in several tuples to keep them relatively small. The main disadvantage is that we use have to use arena here.

Such space shouldn't actually store anything in-memory, only process the tuples in a system on_replace trigger. For that purpose, we could use blackhole engine (never worked with it, so I can be wrong here 😄). If we cannot use blackhole, we can try to create this fake space right before recovery (just as we do with all other system spaces) and delete it after snapshot recovery is finished.

So, basically, the flow is:

On snapshot creation - space _index_order is actually empty. But we write it to snapshot in a special way - we write a bunch of IPROTO_INSERT statements with tuples describing secondary index order (details about such tuples can be found above).
On snapshot recovery - when all other spaces are recovered, we should return to the _index_order space (or write it to the end of the snapshot on its creation) and build all indexes with saved tuple order.

0 replies

Totktonada · 2025-01-10T10:42:55Z

Totktonada
Jan 10, 2025
Maintainer

A new configuration variable is proposed: memtx_use_sk_recovery_data. If the variable is disabled, we sort the secondary key tuples as before and don't write recovery data on snapshot.

As I understood there are variants that are possible to implement without any options. For this we need to ensure that the secondary keys order data don't break old tarantool versions anyhow (including waste of memory). I think that we can consider the following variants:

Separate files.
Blackhole engine.
NOP operations with a payload.

In my opinion, backhole engine/NOP operations are tricks, while holding the extra data in separate files looks like a direct solution. So, it seems better for me.

OTOH, we should verify how snapshotting time is changed in case of several secondary indexes. If it becomes much longer, maybe the option to refuse to write it is useful (but I'm not sure).

It is also important to take into account that modern SSD is able to serve several parallel write requests with the speed near to the single write. I think that parallel write of several files may be very profitable here.

These files are created along with the regular snapshot file in memtx_engine_begin_checkpoint, but only for TREE index and if the space has no before_replace triggers.

Technically speaking, shouldn't we check presence of such triggers while reading the data, not writing?

I guess that if we have the trigger on writing a snapshot, we likely have it after restart too, but there is no such a guarantee.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

Save order of secondary indexes to snapshot #11001

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Tarantool

Save order of secondary indexes to snapshot #11001

mkostoevr Dec 26, 2024 Collaborator

Reviewers

The problem

The algrithm

The storage

Option 0: separated files

Option 1: a system space

The configuration

Changelog

Replies: 8 comments

mkostoevr Jan 9, 2025 Collaborator Author

locker Jan 9, 2025 Maintainer

locker Jan 9, 2025 Maintainer

locker Jan 9, 2025 Maintainer

drewdzzz Jan 9, 2025 Collaborator

drewdzzz Jan 9, 2025 Collaborator

drewdzzz Jan 9, 2025 Collaborator

Totktonada Jan 10, 2025 Maintainer

mkostoevr
Dec 26, 2024
Collaborator

mkostoevr
Jan 9, 2025
Collaborator Author

locker
Jan 9, 2025
Maintainer

locker
Jan 9, 2025
Maintainer

locker
Jan 9, 2025
Maintainer

drewdzzz
Jan 9, 2025
Collaborator

drewdzzz
Jan 9, 2025
Collaborator

drewdzzz
Jan 9, 2025
Collaborator

Totktonada
Jan 10, 2025
Maintainer