Skip to content

Commit

Permalink
chore: README files updated
Browse files Browse the repository at this point in the history
  • Loading branch information
petrlipatov committed Sep 4, 2024
1 parent 143496d commit b992935
Show file tree
Hide file tree
Showing 2 changed files with 135 additions and 83 deletions.
26 changes: 12 additions & 14 deletions README-RU.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ const { rutf8BinaryEncode } = require('rutf8-toolkit');

- `rutf8Encoder`: Кодирует русские символы в ASCII.
- `rutf8Decoder`: Декодирует ASCII обратно в русские символы.
- `rutf8BinaryEncode`: Кодирует двоичные данные с использованием `RUTF-8`.
- `rutf8BinaryDecode`: Декодирует двоичные данные, закодированные с помощью `RUTF-8`.
- `rutf8BinaryEncode`: Кодирует в бинарный буффер с данными `RUTF-8`.
- `rutf8BinaryDecode`: Декодирует из бинарного буффера с данными `RUTF-8`.

#### Примеры:

Expand All @@ -77,9 +77,9 @@ const rutfBinaryEncoded = binaryEncoder(string) // ArrayBuffer

#### Основные функции:

- `huffmanBinaryEncode`: Кодирует двоичные данные с использованием Хаффмана.
- `huffmanBinaryDecode`: Декодирует двоичные данные с использованием Хаффмана.
- `createHuffmanTree`: Создает дерево Хаффмана из текстового ввода (можно визуализировать).
- `huffmanBinaryEncode`: Кодирует в бинарный буффер данные обработанные Хаффманом.
- `huffmanBinaryDecode`: Декодирует из бинарного буффера данные обработанные Хаффманом.
- `createHuffmanTree`: Позволяет увидеть литерал древа кодирования Хаффмана.

#### Схема двоичного кодирования

Expand All @@ -100,9 +100,9 @@ const rutfBinaryEncoded = binaryEncoder(string) // ArrayBuffer

#### Основные функции:

- `lzssBinaryEncode`: Кодирует данные с помощью LZSS.
- `lzssBinaryDecode`: Декодирует данные, сжатые с использованием LZSS.
- `lzssEncode`: Позволяет увидеть схему кодирования LZSS.
- `lzssBinaryEncode`: Кодирует в бинарный буффер данные обработанные с помощью LZSS.
- `lzssBinaryDecode`: Декодирует из бинарного буффера данные, сжатые с использованием LZSS.
- `lzssEncode`: Позволяет увидеть литерал схему кодирования LZSS.

#### Примеры:

Expand Down Expand Up @@ -130,9 +130,9 @@ lzssEncoded.data // [ 'И','м','п','е','р','а','т','о','р',' ','К','а'

#### Основные функции:

- `lz77BinaryEncode`: Кодирует данные с использованием LZ77.
- `lz77BinaryDecode`: Декодирует данные, сжатые с помощью LZ77.
- `lz77Encode`: Позволяет увидеть схему кодирования LZSS.
- `lz77BinaryEncode`: Кодирует данные в бинарный буффер с использованием LZ77.
- `lz77BinaryDecode`: Декодирует данные из бинарного буффера, сжатые с помощью LZ77.
- `lz77Encode`: Позволяет увидеть литерал схему кодирования LZ77.

#### Примеры:

Expand All @@ -142,9 +142,7 @@ const string =
const lz77Encoded = lz77Encode(string)
lz77Encoded.length // 63
lz77Encoded.schema // [ 0, 0, 0, 0, 0, 0, 53, 0 ]
lz77Encoded.data // [ [ 0, 0, 'И' ],[ 0, 0, 'м' ],[ 0, 0, 'п' ],[ 0, 0, 'е' ],[ 0, 0, 'р' ],[ 0, 0, 'а' ],[ 0, 0, 'т' ],[ 0, 0, 'о' ],[ 4, 1, ' ' ],[ 0, 0, 'К' ],[ 6, 1, 'р' ],[ 0, 0, 'л' ],[ 0, 0, '-' ],[ 0, 0, 'Ф' ],[ 12, 2, 'н' ],[ 0, 0, 'ц' ],[ 11, 1, 'о' ],[ 0, 0, 'б' ],[ 0, 0, 'ы' ],[ 0, 0, 'ч' ],[ 7, 1, 'о' ],[ 7, 2, 'д' ],[ 27, 1, 'т' ],[ 5, 1, 'в' ],[ 2, 1, 'п' ],[ 8, 1, 'л' ],[ 13, 1, 'ы' ],[ 0, 0, 'й' ],[ 7, 1, 'д' ],[ 7, 1, 'с' ],[ 43, 2, 'х' ],[ 0, 0, '.' ],[ 8, 1, 'И' ],[ 50, 15, 'р' ],[ 50, 3, '.' ],[ 51, 12, '.' ],[ 52, 15, 'х' ],[ 17, 1, '\u0000' ] ]
lz77Encoded // [ [ 0, 0, 'И' ],[ 0, 0, 'м' ],[ 0, 0, 'п' ],[ 0, 0, 'е' ],[ 0, 0, 'р' ],[ 0, 0, 'а' ],[ 0, 0, 'т' ],[ 0, 0, 'о' ],[ 4, 1, ' ' ],[ 0, 0, 'К' ],[ 6, 1, 'р' ],[ 0, 0, 'л' ],[ 0, 0, '-' ],[ 0, 0, 'Ф' ],[ 12, 2, 'н' ],[ 0, 0, 'ц' ],[ 11, 1, 'о' ],[ 0, 0, 'б' ],[ 0, 0, 'ы' ],[ 0, 0, 'ч' ],[ 7, 1, 'о' ],[ 7, 2, 'д' ],[ 27, 1, 'т' ],[ 5, 1, 'в' ],[ 2, 1, 'п' ],[ 8, 1, 'л' ],[ 13, 1, 'ы' ],[ 0, 0, 'й' ],[ 7, 1, 'д' ],[ 7, 1, 'с' ],[ 43, 2, 'х' ],[ 0, 0, '.' ],[ 8, 1, 'И' ],[ 50, 15, 'р' ],[ 50, 3, '.' ],[ 51, 12, '.' ],[ 52, 15, 'х' ],[ 17, 1, '\u0000' ] ]
```

#### Схема двоичного кодирования
Expand Down
192 changes: 123 additions & 69 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,136 +1,190 @@
# rutf8-toolkit

**rutf8-toolkit** is a library designed for serialization and compression of small cyrillic payloads. This toolkit offers a variety of algorithms to encode, compress, and decode text data, supporting multiple encoding and compression methods such as RUTF-8, Huffman, LZSS, LZ77, BWT, and RLE.
A library for serialization and compression of small Cyrillic texts. It includes algorithms for encoding, compression, and decoding, such as RUTF8, Huffman, LZSS, LZ77, BWT, and RLE.

## Installation

To install **rutf8-toolkit**, use npm:

```bash
npm install rutf8-toolkit
```

ESM & CommonJS support:

```
import { rutf8BinaryEncode } from 'rutf8-toolkit';
const { rutf8BinaryEncode } = require('rutf8-toolkit');
```

## Features

- **rutf8:** Store and transmit cyrillic letters as uint8.
- **Huffman:** Apply variable-length encoding to text using the Huffman tree algorithm.
- **LZSS & LZ77:** Utilize sliding window-based algorithms to compress text data.
- **BWT:** Reversibly sort text data to improve compression efficiency with the Burrows-Wheeler Transform.
- **RLE:** Implement Run-Length Encoding, especially effective with repetitive data.
- **Optimal Compression:** Calculate the theoretical optimal compression size for a string based on Shannon entropy.
- **rutf8:** Store and transmit Cyrillic letters in the uint8 format with Unicode support.
- **Huffman:** Variable-length encoding using Huffman trees.
- **LZSS and LZ77:** Text compression using sliding window algorithms.
- **BWT:** Reversible Burrows-Wheeler transformations for improved compression.
- **RLE:** Simple run-length encoding for repeated symbols.
- **Optimal compression:** Calculate the theoretically minimal\* possible string size using Shannon entropy.

## Combining Algorithms

To achieve the best results with Cyrillic text, it is recommended to apply these algorithms sequentially. Combining multiple compression and encoding techniques can significantly enhance the overall efficiency of data storage and transmission. For example, preprocessing text with the RUTF8 or BWT before applying RLE or Huffman encoding often yields more effective compression.
For interesting results, it is recommended to use several algorithms in sequence. Combining different compression and encoding methods can significantly improve the efficiency of data storage and transmission. For example, you can pre-process the text using `RUTF8`, then apply `BWT`, followed by `RLE` and `Huffman` encoding.

## Modules

### 1) rutf8

RUTF-8 is a custom encoding system that maps Russian Unicode characters to predefined ASCII symbols and vice versa. This allows Russian text to be encoded as single-byte ASCII characters, while maintaining full support for all other characters.
A custom encoding method that swaps Cyrillic characters and ASCII characters in the Unicode table. This allows you to represent Russian text with single-byte ASCII characters while preserving full support for all other Unicode symbols.

#### How does it work:
#### How it works:

- The algorithm performs a two-way mapping:
- Russian characters are replaced with corresponding ASCII characters when encoding.
- ASCII characters are replaced with their corresponding Russian characters when decoding.
- All symbols are supported, and no characters need to be removed or replaced during encoding/decoding.
- The algorithm performs a bidirectional transformation:
- Russian characters are replaced with corresponding ASCII during encoding.
- ASCII characters are replaced with corresponding Russian characters during decoding.
- All symbols are supported, so nothing is lost during encoding/decoding.

#### Functions:

- `rutf8Encoder`: Encodes Russian characters to ASCII.
- `rutf8Decoder`: Decodes ASCII back to Russian characters.
- `binaryEncoder`: Encodes binary data using RUTF-8.
- `binaryDecoder`: Decodes binary data encoded with RUTF-8.
- `rutf8Encoder`: Encodes Russian characters into ASCII.
- `rutf8Decoder`: Decodes ASCII back into Russian characters.
- `rutf8BinaryEncode`: Encodes data with `RUTF-8` coding and stores it in a binary buffer.
- `rutf8BinaryDecode`: Decodes data from a binary buffer that was encoded with Huffman coding.

#### Examples:

```
const string = 'Карл-Франц'
const rutfEncoded = rutf8Encoder(string) // 'Larm-Vraox'
```

```
const string = 'Карл-Франц'
const rutfBinaryEncoded = binaryEncoder(string) // ArrayBuffer
```

### 2) Huffman

The Huffman module applies variable-length encoding to text content using the Huffman tree algorithm. This technique helps compress data by assigning shorter codes to more frequent characters.
The Huffman coding module uses the Huffman tree algorithm for variable-length encoding, allowing data compression by assigning shorter codes to frequent symbols.

#### How does it work:
#### How it works:

- Analyzes character frequencies to determine optimal encodings.
- Creates efficient variable-length codes for each character using a Huffman Tree.
- Encodes input text with these variable-length codes.
- Packs the Huffman tree and encoded data into a binary buffer for storage or transmission.
- Decodes text without needing an external schema (schemaless decoding).
- Analyzes the frequency of symbols.
- Creates variable-length codes for each symbol.
- Encodes the text using these codes and packs the Huffman tree together with the encoded data into a binary buffer.
- Decodes the text without the need for additional schema (schema-less decoding).

#### Functions:

- `binaryEncoder`: Encodes binary data using Huffman encoding.
- `binaryDecoder`: Decodes binary data using Huffman decoding.
- `createHuffmanTree`: Generates a Huffman tree based on text input (allows to see literal tree).
- `huffmanBinaryEncode`: Encodes data with Huffman coding and stores it in a binary buffer.
- `huffmanBinaryDecode`: Decodes data from a binary buffer that was encoded with Huffman coding.
- `createHuffmanTree`: Creates a Huffman tree from text input (can be visualized).

#### Binary Encoding
#### Binary Encoding Schema

![Huffman Binary Schema](https://i.imgur.com/XtOWnG0.jpeg)

### 3) LZSS

LZSS is an optimized version of the LZ77 algorithm, offering text data compression using a sliding window technique.
`LZSS` is an optimized version of `LZ77` that compresses text data using a sliding window.

#### How does it work:
#### How it works:

- `Lookahead Buffer` holds the upcoming characters that the algorithm will attempt to match against the search buffer. It allows the algorithm to anticipate and find repeating patterns in the data.
- `Search Buffer` contains previously processed characters and is used to find matches for the data in the lookahead buffer.
- LZSS encodes repeated patterns with tuples [offset, length]
- Encoding happens only if repeated pattern length > 2
- In other cases chars are coded as uint8 (Ascii)
- If a match is found at the end of the Search Buffer, the algorithm will test the matched pattern against the remaining Lookahead Buffer and apply Run-Length Encoding (RLE) if possible.
- `binaryEncoder(input, options)` as 2d parameter takes option object
- options allow to set length of Search Buffer and Lookahead Buffer
- default values: { searchBufferLength = 255, lookaheadLength = 15 }
- max values: { searchBufferLength = 4095, lookaheadLength = 15 }
- `Lookahead Buffer` The `Lookahead Buffer` contains symbols that the algorithm will try to match with the `Search Buffer`. This helps the algorithm find repeating patterns.
- `The Search Buffer` contains already processed symbols and is used to find matches with the Lookahead Buffer.
- LZSS encodes repeating patterns as `[offset, length]` tuples::
- Encoding happens only if the repeating pattern length > 2 characters.
- Otherwise, characters are encoded as uint8 (ASCII).
- If a match is found at the end of the Search Buffer, the algorithm checks if RLE can be applied.

#### Functions:

- `binaryEncoder`: Encodes text data using LZSS compression.
- `binaryDecoder`: Decodes text data compressed with LZSS.
- `encoder`: Encodes text data using LZSS compression (allows to see literal encoding).
- `lzssBinaryEncode:` Encodes data with LZSS coding and stores it in a binary buffer.
- `lzssBinaryDecode:` Decodes data from a binary buffer that was encoded with LZSS coding.
- `lzssEncode:` Allows you to see the literal LZSS encoding schema.

#### Examples:

```
const string =
"Император Карл-Франц обычно одет в полный доспех. Император Карл-Франц. обычно одет. в полный доспех.";
#### Binary Encoding
const lzssEncoded = lzssEncode(string)
lzssEncoded.length // 63
lzssEncoded.schema // [ 0, 0, 0, 0, 0, 0, 53, 0 ]
lzssEncoded.data // [ 'И','м','п','е','р','а','т','о','р',' ','К','а','р','л','-','Ф','р','а','н','ц',' ','о','б','ы','ч','н','о',' ','о','д','е','т',' ','в',' ','п','о','л','н','ы','й',' ','д','о','с','п','е','х','.',' ',[ 50, 14 ],[ 50, 6 ],'.',[ 51, 12 ],'.',[ 52, 14 ],'е','х','.' ],
```

#### Binary Encoding Schema

![LZSS Schema](https://i.imgur.com/aqZbYui.jpeg)

### 4) LZ77

LZ77 is one of the fundamental algorithms in text compression, invented by Abraham Lempel and Jacob Ziv in 1977. It uses a sliding window to find repeating sequences.
LZ77 is one of the foundational text compression algorithms proposed by Lempel and Ziv in 1977. This algorithm also uses a sliding window to find repeating sequences of characters.

- LZ77 encodes symbols and repeated patterns with tuples [offset, length, next unmatched char]
- If a match is found at the end of the Search Buffer, the algorithm will test the matched pattern against the remaining Lookahead Buffer and apply Run-Length Encoding (RLE) if possible.
- `binaryEncoder(input, options)` as 2d parameter takes option object
- options allow to set length of Search Buffer and Lookahead Buffer
- default values: { searchBufferLength = 255, lookaheadLength = 15 }
- max values: { searchBufferLength = 4095, lookaheadLength = 15 }
- LZ77 encodes symbols and repeating patterns using `[offset, length, next symbol]` tuples.
- If a match is found at the end of the `Search Buffer`, the algorithm checks and applies `RLE` if possible.

#### Functions:

- `binaryEncoder`: Encodes text data using LZ77 compression.
- `binaryDecoder`: Decodes text data compressed with LZ77.
- `encoder`: Encodes text data using LZ77 compression (allows to see literal encoding).
- `lz77BinaryEncode:` Encodes data with LZ77 coding and stores it in a binary buffer.
- `lz77BinaryDecode:` Decodes data from a binary buffer that was encoded with LZ77 coding.
- `lz77Encode:` Allows you to see the LZ77 encoding schema.

#### Examples:

```
const string =
"Император Карл-Франц обычно одет в полный доспех. Император Карл-Франц. обычно одет. в полный доспех.";
#### Binary Encoding
const lz77Encoded = lz77Encode(string)
lz77Encoded // [ [ 0, 0, 'И' ],[ 0, 0, 'м' ],[ 0, 0, 'п' ],[ 0, 0, 'е' ],[ 0, 0, 'р' ],[ 0, 0, 'а' ],[ 0, 0, 'т' ],[ 0, 0, 'о' ],[ 4, 1, ' ' ],[ 0, 0, 'К' ],[ 6, 1, 'р' ],[ 0, 0, 'л' ],[ 0, 0, '-' ],[ 0, 0, 'Ф' ],[ 12, 2, 'н' ],[ 0, 0, 'ц' ],[ 11, 1, 'о' ],[ 0, 0, 'б' ],[ 0, 0, 'ы' ],[ 0, 0, 'ч' ],[ 7, 1, 'о' ],[ 7, 2, 'д' ],[ 27, 1, 'т' ],[ 5, 1, 'в' ],[ 2, 1, 'п' ],[ 8, 1, 'л' ],[ 13, 1, 'ы' ],[ 0, 0, 'й' ],[ 7, 1, 'д' ],[ 7, 1, 'с' ],[ 43, 2, 'х' ],[ 0, 0, '.' ],[ 8, 1, 'И' ],[ 50, 15, 'р' ],[ 50, 3, '.' ],[ 51, 12, '.' ],[ 52, 15, 'х' ],[ 17, 1, '\u0000' ] ]
```

#### Binary Encoding Schema

![LZ77 Schema](https://i.imgur.com/j25RpK8.jpeg)

### 5) BWT

The Burrows-Wheeler Transform (BWT) is a data preprocessing algorithm that rearranges text data in a reversible way, making it more suitable for compression.
The Burrows-Wheeler Transform (BWT) is a reversible data pre-processing algorithm that rearranges the symbols of a text to improve subsequent compression. One advantage of this method is the simplicity of restoring the original text—just one number needs to be saved.

#### Functions:

- `bwtEncode`: Encodes text data using BWT.
- `bwtDecode`: Decodes BWT-encoded data back to the original form.
- `bwtEncode:` Encodes text using BWT.
- `bwtDecode:` Restores the original text from BWT-encoded data.

#### Examples:

```
const string = 'banana'
const bwtEncoded = bwtEncode(string) // { bwt: 'annb$aa', index: 4 }
const bwtDecoded = bwtDecode(bwtEncoded.bwt, bwtEncoded.index) // 'banana'
```

### 6) RLE (Run-Length Encoding)

Run-Length Encoding (RLE) is a simple and effective compression algorithm, especially useful for data with long sequences of repeated characters.
Run-Length Encoding (RLE) is a simple compression technique that replaces sequences of the same character with a single character and a number indicating the length of the sequence.

#### Functions:

- `rleEncoder`: Encodes data using RLE compression.
- `rleDecoder`: Decodes data compressed with RLE.
- `rleEncode:` Compresses text using RLE.
- `rleDecode:` Restores text compressed with RLE.

### 7) Miscellaneous
#### Examples:

- `calculateOptimalBytesCompression`: This function calculates the theoretically optimal compression size for a given string based on Shannon entropy. Note: This metric is not definitive and can often be surpassed by other methods.
```
const string = 'aaab4bbbbbcc'
const rleEncoded = rleEncode(string) // "a3b4b5c2"
```

## Installation
### 7) Miscellaneous

To install the **rutf8-toolkit** library, use npm:
- `getByteLength:` This function returns the size of a string in bytes.

```bash
npm install rutf8-toolkit
```
- `calculateOptimalBytesCompression:` This function calculates the theoretically possible minimal number of bytes for compressing a string based on Shannon entropy. Note that this estimate may not match the actual result from other compression methods.

0 comments on commit b992935

Please sign in to comment.