-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
143496d
commit 051443a
Showing
2 changed files
with
124 additions
and
72 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,136 +1,190 @@ | ||
# rutf8-toolkit | ||
|
||
**rutf8-toolkit** is a library designed for serialization and compression of small cyrillic payloads. This toolkit offers a variety of algorithms to encode, compress, and decode text data, supporting multiple encoding and compression methods such as RUTF-8, Huffman, LZSS, LZ77, BWT, and RLE. | ||
A library for serialization and compression of small Cyrillic texts. It includes algorithms for encoding, compression, and decoding, such as RUTF8, Huffman, LZSS, LZ77, BWT, and RLE. | ||
|
||
## Установка | ||
|
||
To install **rutf8-toolkit**, use npm: | ||
|
||
```bash | ||
npm install rutf8-toolkit | ||
``` | ||
|
||
ESM & CommonJS support: | ||
|
||
``` | ||
import { rutf8BinaryEncode } from 'rutf8-toolkit'; | ||
const { rutf8BinaryEncode } = require('rutf8-toolkit'); | ||
``` | ||
|
||
## Features | ||
|
||
- **rutf8:** Store and transmit cyrillic letters as uint8. | ||
- **Huffman:** Apply variable-length encoding to text using the Huffman tree algorithm. | ||
- **LZSS & LZ77:** Utilize sliding window-based algorithms to compress text data. | ||
- **BWT:** Reversibly sort text data to improve compression efficiency with the Burrows-Wheeler Transform. | ||
- **RLE:** Implement Run-Length Encoding, especially effective with repetitive data. | ||
- **Optimal Compression:** Calculate the theoretical optimal compression size for a string based on Shannon entropy. | ||
- **rutf8:** Store and transmit Cyrillic letters in the uint8 format with Unicode support. | ||
- **Huffman:** Variable-length encoding using Huffman trees. | ||
- **LZSS and LZ77:** Text compression using sliding window algorithms. | ||
- **BWT:** Reversible Burrows-Wheeler transformations for improved compression. | ||
- **RLE:** Simple run-length encoding for repeated symbols. | ||
- **Optimal compression:** Calculate the theoretically minimal\* possible string size using Shannon entropy. | ||
|
||
## Combining Algorithms | ||
|
||
To achieve the best results with Cyrillic text, it is recommended to apply these algorithms sequentially. Combining multiple compression and encoding techniques can significantly enhance the overall efficiency of data storage and transmission. For example, preprocessing text with the RUTF8 or BWT before applying RLE or Huffman encoding often yields more effective compression. | ||
For interesting results, it is recommended to use several algorithms in sequence. Combining different compression and encoding methods can significantly improve the efficiency of data storage and transmission. For example, you can pre-process the text using `RUTF8`, then apply `BWT`, followed by `RLE` and `Huffman` encoding. | ||
|
||
## Modules | ||
|
||
### 1) rutf8 | ||
|
||
RUTF-8 is a custom encoding system that maps Russian Unicode characters to predefined ASCII symbols and vice versa. This allows Russian text to be encoded as single-byte ASCII characters, while maintaining full support for all other characters. | ||
A custom encoding method that swaps Cyrillic characters and ASCII characters in the Unicode table. This allows you to represent Russian text with single-byte ASCII characters while preserving full support for all other Unicode symbols. | ||
|
||
#### How does it work: | ||
#### How it works: | ||
|
||
- The algorithm performs a two-way mapping: | ||
- Russian characters are replaced with corresponding ASCII characters when encoding. | ||
- ASCII characters are replaced with their corresponding Russian characters when decoding. | ||
- All symbols are supported, and no characters need to be removed or replaced during encoding/decoding. | ||
- The algorithm performs a bidirectional transformation: | ||
- Russian characters are replaced with corresponding ASCII during encoding. | ||
- ASCII characters are replaced with corresponding Russian characters during decoding. | ||
- All symbols are supported, so nothing is lost during encoding/decoding. | ||
|
||
#### Functions: | ||
|
||
- `rutf8Encoder`: Encodes Russian characters to ASCII. | ||
- `rutf8Decoder`: Decodes ASCII back to Russian characters. | ||
- `binaryEncoder`: Encodes binary data using RUTF-8. | ||
- `binaryDecoder`: Decodes binary data encoded with RUTF-8. | ||
- `rutf8Encoder`: Encodes Russian characters into ASCII. | ||
- `rutf8Decoder`: Decodes ASCII back into Russian characters. | ||
- `rutf8BinaryEncode`: Encodes binary data using`RUTF-8`. | ||
- `rutf8BinaryDecode`: Decodes binary data encoded with `RUTF-8`. | ||
|
||
#### Examples: | ||
|
||
``` | ||
const string = 'Карл-Франц' | ||
const rutfEncoded = rutf8Encoder(string) // 'Larm-Vraox' | ||
``` | ||
|
||
``` | ||
const string = 'Карл-Франц' | ||
const rutfBinaryEncoded = binaryEncoder(string) // ArrayBuffer | ||
``` | ||
|
||
### 2) Huffman | ||
|
||
The Huffman module applies variable-length encoding to text content using the Huffman tree algorithm. This technique helps compress data by assigning shorter codes to more frequent characters. | ||
The Huffman coding module uses the Huffman tree algorithm for variable-length encoding, allowing data compression by assigning shorter codes to frequent symbols. | ||
|
||
#### How does it work: | ||
#### How it works: | ||
|
||
- Analyzes character frequencies to determine optimal encodings. | ||
- Creates efficient variable-length codes for each character using a Huffman Tree. | ||
- Encodes input text with these variable-length codes. | ||
- Packs the Huffman tree and encoded data into a binary buffer for storage or transmission. | ||
- Decodes text without needing an external schema (schemaless decoding). | ||
- Analyzes the frequency of symbols. | ||
- Creates variable-length codes for each symbol. | ||
- Encodes the text using these codes and packs the Huffman tree together with the encoded data into a binary buffer. | ||
- Decodes the text without the need for additional schema (schema-less decoding). | ||
|
||
#### Functions: | ||
|
||
- `binaryEncoder`: Encodes binary data using Huffman encoding. | ||
- `binaryDecoder`: Decodes binary data using Huffman decoding. | ||
- `createHuffmanTree`: Generates a Huffman tree based on text input (allows to see literal tree). | ||
- `huffmanBinaryEncode`: Encodes binary data using Huffman. | ||
- `huffmanBinaryDecode`: Decodes binary data using Huffman. | ||
- `createHuffmanTree`: Creates a Huffman tree from text input (can be visualized). | ||
|
||
#### Binary Encoding | ||
#### Binary Encoding Schema | ||
|
||
![Huffman Binary Schema](https://i.imgur.com/XtOWnG0.jpeg) | ||
|
||
### 3) LZSS | ||
|
||
LZSS is an optimized version of the LZ77 algorithm, offering text data compression using a sliding window technique. | ||
`LZSS` is an optimized version of `LZ77` that compresses text data using a sliding window. | ||
|
||
#### How does it work: | ||
#### How it works: | ||
|
||
- `Lookahead Buffer` holds the upcoming characters that the algorithm will attempt to match against the search buffer. It allows the algorithm to anticipate and find repeating patterns in the data. | ||
- `Search Buffer` contains previously processed characters and is used to find matches for the data in the lookahead buffer. | ||
- LZSS encodes repeated patterns with tuples [offset, length] | ||
- Encoding happens only if repeated pattern length > 2 | ||
- In other cases chars are coded as uint8 (Ascii) | ||
- If a match is found at the end of the Search Buffer, the algorithm will test the matched pattern against the remaining Lookahead Buffer and apply Run-Length Encoding (RLE) if possible. | ||
- `binaryEncoder(input, options)` as 2d parameter takes option object | ||
- options allow to set length of Search Buffer and Lookahead Buffer | ||
- default values: { searchBufferLength = 255, lookaheadLength = 15 } | ||
- max values: { searchBufferLength = 4095, lookaheadLength = 15 } | ||
- `Lookahead Buffer` The `Lookahead Buffer` contains symbols that the algorithm will try to match with the `Search Buffer`. This helps the algorithm find repeating patterns. | ||
- `The Search Buffer` contains already processed symbols and is used to find matches with the Lookahead Buffer. | ||
- LZSS encodes repeating patterns as `[offset, length]` tuples:: | ||
- Encoding happens only if the repeating pattern length > 2 characters. | ||
- Otherwise, characters are encoded as uint8 (ASCII). | ||
- If a match is found at the end of the Search Buffer, the algorithm checks if RLE can be applied. | ||
|
||
#### Functions: | ||
|
||
- `binaryEncoder`: Encodes text data using LZSS compression. | ||
- `binaryDecoder`: Decodes text data compressed with LZSS. | ||
- `encoder`: Encodes text data using LZSS compression (allows to see literal encoding). | ||
- `lzssBinaryEncode:` Encodes data using LZSS. | ||
- `lzssBinaryDecode:` Decodes data compressed with LZSS. | ||
- `lzssEncode:` Allows you to see the LZSS encoding schema. | ||
|
||
#### Examples: | ||
|
||
``` | ||
const string = | ||
"Император Карл-Франц обычно одет в полный доспех. Император Карл-Франц. обычно одет. в полный доспех."; | ||
#### Binary Encoding | ||
const lzssEncoded = lzssEncode(string) | ||
lzssEncoded.length // 63 | ||
lzssEncoded.schema // [ 0, 0, 0, 0, 0, 0, 53, 0 ] | ||
lzssEncoded.data // [ 'И','м','п','е','р','а','т','о','р',' ','К','а','р','л','-','Ф','р','а','н','ц',' ','о','б','ы','ч','н','о',' ','о','д','е','т',' ','в',' ','п','о','л','н','ы','й',' ','д','о','с','п','е','х','.',' ',[ 50, 14 ],[ 50, 6 ],'.',[ 51, 12 ],'.',[ 52, 14 ],'е','х','.' ], | ||
``` | ||
|
||
#### Binary Encoding Schema | ||
|
||
![LZSS Schema](https://i.imgur.com/aqZbYui.jpeg) | ||
|
||
### 4) LZ77 | ||
|
||
LZ77 is one of the fundamental algorithms in text compression, invented by Abraham Lempel and Jacob Ziv in 1977. It uses a sliding window to find repeating sequences. | ||
LZ77 is one of the foundational text compression algorithms proposed by Lempel and Ziv in 1977. This algorithm also uses a sliding window to find repeating sequences of characters. | ||
|
||
- LZ77 encodes symbols and repeated patterns with tuples [offset, length, next unmatched char] | ||
- If a match is found at the end of the Search Buffer, the algorithm will test the matched pattern against the remaining Lookahead Buffer and apply Run-Length Encoding (RLE) if possible. | ||
- `binaryEncoder(input, options)` as 2d parameter takes option object | ||
- options allow to set length of Search Buffer and Lookahead Buffer | ||
- default values: { searchBufferLength = 255, lookaheadLength = 15 } | ||
- max values: { searchBufferLength = 4095, lookaheadLength = 15 } | ||
- LZ77 encodes symbols and repeating patterns using `[offset, length, next symbol]` tuples. | ||
- If a match is found at the end of the `Search Buffer`, the algorithm checks and applies `RLE` if possible. | ||
|
||
#### Functions: | ||
|
||
- `binaryEncoder`: Encodes text data using LZ77 compression. | ||
- `binaryDecoder`: Decodes text data compressed with LZ77. | ||
- `encoder`: Encodes text data using LZ77 compression (allows to see literal encoding). | ||
- `lz77BinaryEncode:` Encodes data using LZ77. | ||
- `lz77BinaryDecode:` Decodes data compressed with LZ77. | ||
- `lz77Encode:` Allows you to see the LZ77 encoding schema. | ||
|
||
#### Examples: | ||
|
||
``` | ||
const string = | ||
"Император Карл-Франц обычно одет в полный доспех. Император Карл-Франц. обычно одет. в полный доспех."; | ||
#### Binary Encoding | ||
const lz77Encoded = lz77Encode(string) | ||
lz77Encoded // [ [ 0, 0, 'И' ],[ 0, 0, 'м' ],[ 0, 0, 'п' ],[ 0, 0, 'е' ],[ 0, 0, 'р' ],[ 0, 0, 'а' ],[ 0, 0, 'т' ],[ 0, 0, 'о' ],[ 4, 1, ' ' ],[ 0, 0, 'К' ],[ 6, 1, 'р' ],[ 0, 0, 'л' ],[ 0, 0, '-' ],[ 0, 0, 'Ф' ],[ 12, 2, 'н' ],[ 0, 0, 'ц' ],[ 11, 1, 'о' ],[ 0, 0, 'б' ],[ 0, 0, 'ы' ],[ 0, 0, 'ч' ],[ 7, 1, 'о' ],[ 7, 2, 'д' ],[ 27, 1, 'т' ],[ 5, 1, 'в' ],[ 2, 1, 'п' ],[ 8, 1, 'л' ],[ 13, 1, 'ы' ],[ 0, 0, 'й' ],[ 7, 1, 'д' ],[ 7, 1, 'с' ],[ 43, 2, 'х' ],[ 0, 0, '.' ],[ 8, 1, 'И' ],[ 50, 15, 'р' ],[ 50, 3, '.' ],[ 51, 12, '.' ],[ 52, 15, 'х' ],[ 17, 1, '\u0000' ] ] | ||
``` | ||
|
||
#### Binary Encoding Schema | ||
|
||
![LZ77 Schema](https://i.imgur.com/j25RpK8.jpeg) | ||
|
||
### 5) BWT | ||
|
||
The Burrows-Wheeler Transform (BWT) is a data preprocessing algorithm that rearranges text data in a reversible way, making it more suitable for compression. | ||
The Burrows-Wheeler Transform (BWT) is a reversible data pre-processing algorithm that rearranges the symbols of a text to improve subsequent compression. One advantage of this method is the simplicity of restoring the original text—just one number needs to be saved. | ||
|
||
#### Functions: | ||
|
||
- `bwtEncode`: Encodes text data using BWT. | ||
- `bwtDecode`: Decodes BWT-encoded data back to the original form. | ||
- `bwtEncode:` Encodes text using BWT. | ||
- `bwtDecode:` Restores the original text from BWT-encoded data. | ||
|
||
#### Examples: | ||
|
||
``` | ||
const string = 'banana' | ||
const bwtEncoded = bwtEncode(string) // { bwt: 'annb$aa', index: 4 } | ||
const bwtDecoded = bwtDecode(bwtEncoded.bwt, bwtEncoded.index) // 'banana' | ||
``` | ||
|
||
### 6) RLE (Run-Length Encoding) | ||
|
||
Run-Length Encoding (RLE) is a simple and effective compression algorithm, especially useful for data with long sequences of repeated characters. | ||
Run-Length Encoding (RLE) is a simple compression technique that replaces sequences of the same character with a single character and a number indicating the length of the sequence. | ||
|
||
#### Functions: | ||
|
||
- `rleEncoder`: Encodes data using RLE compression. | ||
- `rleDecoder`: Decodes data compressed with RLE. | ||
- `rleEncode:` Compresses text using RLE. | ||
- `rleDecode:` Restores text compressed with RLE. | ||
|
||
### 7) Miscellaneous | ||
#### Examples: | ||
|
||
- `calculateOptimalBytesCompression`: This function calculates the theoretically optimal compression size for a given string based on Shannon entropy. Note: This metric is not definitive and can often be surpassed by other methods. | ||
``` | ||
const string = 'aaab4bbbbbcc' | ||
const rleEncoded = rleEncode(string) // "a3b4b5c2" | ||
``` | ||
|
||
## Installation | ||
### 7) Miscellaneous | ||
|
||
To install the **rutf8-toolkit** library, use npm: | ||
- `getByteLength:` This function returns the size of a string in bytes. | ||
|
||
```bash | ||
npm install rutf8-toolkit | ||
``` | ||
- `calculateOptimalBytesCompression:` This function calculates the theoretically possible minimal number of bytes for compressing a string based on Shannon entropy. Note that this estimate may not match the actual result from other compression methods. |