Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tracking] Breaking changes in V2 #54

Closed
8 tasks done
marvin-j97 opened this issue May 22, 2024 · 2 comments
Closed
8 tasks done

[Tracking] Breaking changes in V2 #54

marvin-j97 opened this issue May 22, 2024 · 2 comments
Assignees

Comments

@marvin-j97
Copy link
Collaborator

marvin-j97 commented May 22, 2024

API

  • Remove FlushMode alias
  • Enable bloom feature by default

Data format

@i18nsite
Copy link

i18nsite commented May 23, 2024

I hope that fixed-length key values ​​can be considered when designing the format. Many times, keys and values ​​can be fixed-length (such as u64 id - file hash). I believe that fixed-length fields can be optimized a lot.

I think you can refer to duckdb and consider writing data to the log regularly and compressing it into parquet format.
https://duckdb.org/docs/data/parquet/overview.html
https://parquet.apache.org

I believe this format does a lot of optimizations for the data

You can use this library to read and write https://docs.rs/parquet/latest/parquet/

@marvin-j97
Copy link
Collaborator Author

marvin-j97 commented May 23, 2024

I hope that fixed-length key values ​​can be considered when designing the format. Many times, keys and values ​​can be fixed-length (such as u64 id - file hash). I believe that fixed-length fields can be optimized a lot.

I'm not sure if fixed lengths can really be optimized in block based tables. You would at most save 3 byte per K-V pair for a lot of added complexity. It could save you some decent space for huge data sets, but not in block-based tables, and right now I don't plan on adding other types of tables.

compressing it into parquet format.

Parquet is a column-based format with row groups. There is no notion of columns or rows here, so I'm not sure there is an advantage over packed K-V blocks. I have some interest in implementing an alternative block format that is row group based. The current blocks are KVKVKVKV, but an alternative Parquet-esque format could be KKKKVVVV, which would allow for better compression, depending on the values.

@marvin-j97 marvin-j97 pinned this issue May 25, 2024
@marvin-j97 marvin-j97 added enhancement New feature or request api labels May 25, 2024
@marvin-j97 marvin-j97 unpinned this issue Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants