-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choose data structure(s) to use for symbol tables #13
Comments
My lack of attention to the text format is showing here. I had the impression that symbol tables in the text format were just weird-looking values, but I suppose I'm missing something. I've been thinking about this topic for the binary encoding, since I'm getting around to working on serialization. A first pass over the Ion stream to encode can accumulate the symbol values into a counting map. Sorting the keys by their counts greatest to least gives the order that they should appear in the symbol table. The local symbol table can then be serialized, and the counting map can be reused by updating the count values to now store the index of each key. Then a second pass over the Ion stream can serialize the values, looking up each symbol's text in the map to find the symbol ID.
I agree that this is unrealistic. The test's This is actually something irritating about Ion, the spec lays out support for all sorts of arbitrary-sized things, and then implementations frequently don't actually support them due to implementation overhead. I'm don't like to artificially limit sizes that the spec says should be unbounded, but I think we should for things that are clearly not useful. As an example, throughout the binary parser there's an assumption that Side note: I've been pondering a macro for generating binary parsing combinators for a sequence of maximum byte lengths of
I don't know what the text encoding needs (perhaps you have a reference), and I think the binary encoding should use different data structures for serializing and deserializing. To support serializing a (list, map) is needed, but for deserializing we only need (list, map) if we want to support absurdly large symbol IDs. It might make sense to split the difference and have a
The binary format shouldn't ever require locking. Between one IVM and the next the local symbol table doesn't update, and the symbol table is replaced after the next IVM is encountered. |
In text they can show up as strings like
Out of curiosity, why is the order of items in the encoded This is an interesting part of Ion. Depending on what your transmission patterns are or if you want to be able to work on chunks the requirements for when to write a symbol table are different. It feels like a latency vs. size problem. The more symbol tables you write, the more space you're wasting over the wire. The less often you write a table, the longer you have to wait buffering before you send data. That coupled with the capability for appending via the
Agreed, I had that same thought w/r/t addressability when working through part of the text parser.
I'm getting a compile-time functional lens vibe from this and it's great, though I don't understand it well enough to know why that optional behavior is necessary.
I've also been thinking about some kind of interval-based data structure that would allow delegation to symbol tables. The current symbol table, for example, would always have a first interval of 0-9 mapping with a 0 offset to the system symbol table. The next interval would be to a shared table or the current table if there were none. That adds a layer of indirection and is potentially a little slower, but avoids many allocations that would otherwise occur. It would also probably be more complicated with Rust's memory model, though slightly simplified by the fact that shared symbol tables can be read-only. I'm currently leaning towards "too complicated" for this, but depending on shared symbol table usage it might be advantageous. |
Thanks for clearing that up for me.
Compactness of the binary encoding. Depending on where they show up, symbols can be encoded as VarUints. The first 2^7 symbol IDs can be represented with only 1 byte, but the next (2^14 - 2^7) need 2 bytes, etc. You lose the first 11 slots to the system symbol table, and having more than 2^14 symbols seems very unlikely, so it's basically a matter of ensuring that if you have more than 118 symbols, the most frequently used ones are first. There are also plausible cache performance benefits from keeping all of the most used symbols together at the beginning of the Side note: This is why if we do end up going with a max size for the symbol table
I think a persistent encoder is definitely desirable. At some point I plan to model an
Not necessary, but likely a significant performance boost. Binary encoded Ion frequently employs a "keep reading until you find a stop signal" pattern, which, because variably sized numerics can be greater than the maximum size of any primitive, requires allocating This is something I'm probably going to leave off on doing at least until const generics are stabilized. Moreover it would require buy-in from the Ion maintainers, which is hard to get.
An interval map that maps to offset Leaving aside shared symbol tables, another option would be to have separate |
I ran into some rough edges of using a
Vec<>
symbol table when running some of the tests. The tests in question deal with large values for max_id (a subset of good/subfield*.ion). As an example:This currently causes the tests to halt (though I am not particularly patient with what I consider unit tests) on my system.
The Rules
The rules for symbol lookup are outlined in http://amzn.github.io/ion-docs/guides/symbols-guide.html
In addition to that, the rules for symbol table construction are of interest: http://amzn.github.io/ion-docs/docs/symbols.html#local-symbol-tables
The spec describes:
SymbolTokens can be looked up via their text or their symbol ID. The symbol ID is monotonically increasing key. When looked up by text, the key with the lowest value should be returned. That is, the symbol that was registered first.
Usage
This means our insertion pattern is essentially append-only (as a bonus in most cases we should know the size of large tables before we need to allocate them, not counting symbols encountered in the token stream).
For reads, it's more complicated since we essentially need a bi-directional index which "summarizes" in one direction.
If we focus on the high-performance use cases then we should favor the index lookup use case, which is what we currently do. The reverse lookup is unfortunately O(n).
In addition to this, the aforementioned tests essentially require a sparse data structure. I'm not sure if this is a realistic requirement for other use cases as any situation where this could arise would surely be an error (right?).
A simple solution is a pair of data structures (list, map). Is there a use case that requires locking? Text parsing has to be synchronous as encountered symbols must be added to the symbol table in order of occurrence. The binary format supplies all symbols up front and relies on the symbol IDs for lookup, so the symbol table would not be modified if skip-searched from multiple threads.
The text was updated successfully, but these errors were encountered: