Sorting of numbers of different types in one index #11017

Gerold103 · 2025-01-08T17:39:50Z

Gerold103
Jan 8, 2025
Collaborator

Reviewers

Main Reviewer: @locker
Second Reviewer: @Totktonada
Team Lead: @sergepetrenko
CTO: @sergos

Tickets

Changelog

v2:

Added another section with hash index templates problem.
The double index field type has a new preferrable solution.

Summary

The document explains solution for a problem that Tarantool is facing at the moment of writing about incorrect sorting and hashing of numbers in indexes. It resulted in unaccessible values in unique/non-unique indexes and duplicates in unique indexes, both memtx and vinyl, both hash and tree index types.

Initially the issue was thought to be small and only about suboptimal MessagePack handling. Like when a number is stored not in the most compact way, or when a positive integer was stored with an MP_INT header.

But later the problem revealed a larger scale. Here all the known untrivial issues are listed with solutions. Some of them have multiple solutions, unclear which to choose. Hence the RFC. To gain a consensus on those uncertainties.

The issues

✅ Suboptimal MessagePack

Some unsigned numbers can be stored in 9 different ways in MessagePack. Only one of them being the most optimal and taking a single byte of space. All numbers in [int32_min, uint32_max] range can be stored in more than one way.

The most compact way is obviously better to use, because it takes less space in memory and on disk. However:

Some ways of encoding take equal space. For example, all the numbers in the range [128, int32_max] can be stored in equal number of bytes as MP_INT and MP_UINT. The argument of compactness doesn't work here.
There might be valid MessagePack codecs not maintained by Tarantool team, which could produce not the most optimal encodings, even though they do pack the numbers correctly. Not supporting them would mean Tarantool doesn't support MessagePack, but has its own codec.

The decision was to support the suboptimal MessagePack. All the comparators and hash functions must provide the same results regardless of how numbers are encoded.

✅ Float values in hash index

The same value stored as double and float had different hashes. One could argument for this as something intentional, but there is virtually no case when a value stored in float wouldn't be also available in double type. The difference is the same as storing a number in 5-byte MP_UINT vs 9-byte MP_UINT.

The decision is to hash all floating numbers as double.

✅ Hash index uses different hashing algorithm for some field types

Example:

box.cfg{}
s = box.schema.space.create('test')
s:create_index('pk')
sk = s:create_index('sk',  {parts = {2, 'unsigned'}, type = 'hash'})
s:replace{2, 16}
sk:get{16} ------ found
sk:alter({parts = {2, 'number'}})
sk:get{16} ------ not found

This is broken, because a single unsigned field uses one hashing algorithm, and a single field of any other type uses another algorithm.

One solution could be that the hash index always uses the same algo for a single field, regardless of its type. But that would complicate the code, and according to the benches isn't super beneficial.

The decision is to remove the templated optimization for hash indexes having up to 3 string/unsigned fields.

❓Index behaviour depending on field type

There is quite an untrivial bug, affecting both hash and tree indexes in memtx at least. The problem is that some indexes' behaviour depends on their field types.

For example, if an indexed field type is double, then all its values are forcefully cast to double before comparison/hashing. This was done so people, having a field of type double, could insert there also integers literally as MP_INT and MP_UINT, but they would "behave" like doubles. From user's PoV it would look like the numbers were cast to double before being saved.

That in turn breaks other assumptions. For instance, there is an assumption, that double is included into number type completely. One would expect that conversion of a double field into a number field doesn't need any checks. But it can break the index. Because in a number index the comparison/hashing is based entirely on field values, without any lossy casting. A specific buggy example:

box.cfg{}
s = box.schema.space.create('s')
s:create_index('s_idx', {parts = {1, 'unsigned'}})
uint64_max = 0xffffffffffffffffULL
s:insert{uint64_max}
s:insert{uint64_max - 1}

s.index.s_idx:alter({parts = {1, 'double'}})
s:get{uint64_max}
s:get{uint64_max - 1} ---- same value
-- uint64_max is unreachable via gets.

Initially the unique index contains 2 unique integer numbers. But then the field type is changed to double. If Tarantool double is treated as C/C++ double, then

unsigned -> double field type change shouldn't have worked. Because the different unsigned values stored in the index atm are the same double value.
Assuming the conversion is allowed, the index gets broken - it contains "duplicates", one of which is inaccessible (not found via get, only visible via select/pairs).

Another example:

box.cfg{}
s = box.schema.space.create('s')
pk = s:create_index('pk', {parts = {1, 'unsigned'}})
sk = s:create_index('sk', {parts = {2, 'double'}, unique = false})

uint64_max = 0xffffffffffffffffULL
s:insert{0, uint64_max}
s:insert{1, uint64_max - 1}
sk:select{uint64_max} -- returns 2 values
sk:select{uint64_max - 1} -- returns 2 values

sk:alter({parts = {2, 'number'}})
sk:select{uint64_max} -- returns 0 values???
sk:select{uint64_max - 1} -- returns 2 values???

This is completely nonsense which is hard to explain. But it seems to be related to the same problem.

Another example:

box.cfg{}
s = box.schema.space.create('test')
s:create_index('pk')
sk = s:create_index('sk',  {parts = {2, 'double'}, type = 'hash'})
s:replace{2, 16}
sk:get{16} ------ found
sk:alter({parts = {2, 'number'}})
sk:get{16} ------ not found

Same value is found or not only depending on the field type. Nothing else changes.

The bug has 3 possible solutions:

Make Tarantool's double type work entirely the same as C/C++ double when it comes to indexes. That is, index field type alter unsigned -> double would cause index rebuild. Besides, if the index is unique, this field alter might fail if some integers get converted to the same double. The conversion double <-> number will do the same. Because it changes the ordering and changes the definition of duplicate values. This solution wouldn't break anything that isn't already broken.
Make Tarantool's double type work same as number without decimal. This solution can break gets in unique indexes when people expect integers not fitting the double type to be cast to double. So they intentionally make lookups of value by not matching keys. But that is highly unlikely.
Make Tarantool's double type an alias for number. Same as (2) but even simpler.

The decision is to follow number 1. I.e. make index compare fields using the index's field type. Not the space format or value type. And make sure the alter respects that (i.e. rebuilds the index when needed, and makes no-duplicate checks).

unera · 2025-01-09T07:53:02Z

unera
Jan 9, 2025
Collaborator

comparator MUST have type that defined in index.
So :select{key} MUST convert key to the type before any operations.
Number - is not a type, Number - is metatype (includes int, double, uint, and decimal (in future)), so its comparator must

compare different types by convert one to the other
compare the same types without any convertings

function number_comparator(a, b)
   if type(a) == type(b) then
      return a > b
   end
   if type(a) == 'double' or type(b) == 'double' then -- pseudocode, not real
      return double(a) > double(b)
   end
   -- etc
end

also we should define rules for comparing positive ints vs uints

0 replies

unera · 2025-01-09T07:57:33Z

unera
Jan 9, 2025
Collaborator

Some ways of encoding take equal space. For example, all the numbers in the range [128, int32_max] can be stored in equal number of bytes as MP_INT and MP_UINT. The argument of compactness doesn't work here.

at the point we CAN prefer signed or unsigned value by space:format.field.type or always prefer unsigned

0 replies

sergepetrenko · 2025-01-09T15:24:06Z

sergepetrenko
Jan 9, 2025
Maintainer

I lean towards the first option. It's better not to introduce any behavior changes. Besides, it seems logical that we have an index type matching each possible field type.

AFAICS double field type was added for the sake of SQL DOUBLE. Is there any place where SQL relies on the fact that numbers are compared exactly like doubles?

Make Tarantool's double type work entirely the same as C/C++ double when it comes to indexes. That is, index field type alter unsigned -> double would cause index rebuild. Besides, if the index is unique, this field alter might fail if some integers get converted to the same double. The conversion double <-> number will do the same. Because it changes the ordering and changes the definition of duplicate values. This solution wouldn't break anything that isn't already broken.

1 reply

Gerold103 Jan 9, 2025
Collaborator Author

Is there any place where SQL relies on the fact that numbers are compared exactly like doubles?

No. SQL always compares values using their native types. Not index or field types/meta-types.

locker · 2025-01-10T10:05:36Z

locker
Jan 10, 2025
Maintainer

Make Tarantool's double type work entirely the same as C/C++ double when it comes to indexes. That is, index field type alter unsigned -> double would cause index rebuild. Besides, if the index is unique, this field alter might fail if some integers get converted to the same double. The conversion double <-> number will do the same. Because it changes the ordering and changes the definition of duplicate values. This solution wouldn't break anything that isn't already broken.

Does it mean that 51af059 is going to be reverted? If not, I don't understand why we can't convert 'double' to 'number' without index rebuild: 'double' stores MP_INT and MP_DOUBLE values, which are included by 'number'.

1 reply

Gerold103 Jan 10, 2025
Collaborator Author

Nope, it won't be reverted. We have to rebuild the index, because number has different concept of duplicates and might have different sorting order.

For example. Consider this:

box.cfg{}
s = box.schema.space.create('test')
s:create_index('pk')
uint64_max = 0xffffffffffffffffULL
s:replace{1, uint64_max - 5}
s:replace{2, uint64_max - 2}
s:replace{3, uint64_max - 4}
s:replace{4, uint64_max - 3}
s:replace{5, uint64_max - 1}
s:replace{6, uint64_max}
idx = s:create_index('sk',  {parts = {2, 'double'}, unique = false})

-- Returns all those tuples, for uint64_max, -1, -2, ... -1000.
idx:select{uint64_max}

-- Lets alter it without rebuild.
idx:alter({parts = {2, 'number'}})
idx:select{uint64_max-5} -- Ok
idx:select{uint64_max-4} -- NOT FOUND!!!!
idx:select{uint64_max-3} -- Ok
idx:select{uint64_max-2} -- NOT FOUND!!!

-- Lets rebuild it.
idx:alter({unique = true})
idx:select{uint64_max-4} -- Ok

For those tuples the comparator changes. Which means the sorting can become different. And that requires an index rebuild. To sort it differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarantool

Sorting of numbers of different types in one index #11017

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tarantool

Sorting of numbers of different types in one index #11017

Gerold103 Jan 8, 2025 Collaborator

Reviewers

Tickets

Changelog

v2:

Summary

The issues

✅ Suboptimal MessagePack

✅ Float values in hash index

✅ Hash index uses different hashing algorithm for some field types

❓Index behaviour depending on field type

Replies: 4 comments · 2 replies

unera Jan 9, 2025 Collaborator

unera Jan 9, 2025 Collaborator

sergepetrenko Jan 9, 2025 Maintainer

Gerold103 Jan 9, 2025 Collaborator Author

locker Jan 10, 2025 Maintainer

Gerold103 Jan 10, 2025 Collaborator Author

Gerold103
Jan 8, 2025
Collaborator

Replies: 4 comments 2 replies

unera
Jan 9, 2025
Collaborator

unera
Jan 9, 2025
Collaborator

sergepetrenko
Jan 9, 2025
Maintainer

Gerold103 Jan 9, 2025
Collaborator Author

locker
Jan 10, 2025
Maintainer

Gerold103 Jan 10, 2025
Collaborator Author