Skip to content

A tool for compressing English language text while maintaining streamability

Notifications You must be signed in to change notification settings

Archonic944/ShrinkENG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

14e331e · · Aug 8, 2025

History

56 Commits
Aug 5, 2025
Aug 6, 2025
Aug 6, 2025
Aug 6, 2025
Aug 6, 2025
Aug 5, 2025
Aug 6, 2025
Jul 24, 2025
Aug 6, 2025
Aug 6, 2025
Aug 6, 2025
Aug 5, 2025
Aug 8, 2025
Aug 5, 2025
Aug 6, 2025
Aug 6, 2025
Aug 6, 2025
Aug 6, 2025
Aug 5, 2025
Aug 6, 2025

Repository files navigation

ShrinkENG Natural Language Compression

ShrinkENG is a tool for compressing English text by storing each word as an index of a word within a dictionary. For capitalization, punctuation, and whitespace other than a space, static "operators" are declared in a byte before the word (if necessary). ShrinkENG falls back automatically on UTF-8 encoding for words not in the dictionary.y.

Theoretically, ShrinkENG could be used to compress languages other than English, but it would require a new dictionary and new operator code..

I think of ShrinkENG as less of a compression algorithm and more of a "bytecode generator" for English, since it's not traditionally compressing by mathematical means, but rather translating English into a more compact representation.n.

It is actually recommended to use ShrinkENG in conjunction with a traditional compression algorithm like zlib or LZMA for maximum compression (first compress with ShrinkENG, then compress the output with zlib or LZMA).).

Usage

Simply visit https://shrinkeng.vercel.app/, and upload a file to compress!

Please use a raw text file (txt) instead of a pdf, docx, or rtf.

Compression Ratio

The compression ratio of ShrinkENG is about 50% for a large English text corpus. If there are many words not in the dictionary, or a large amount of punctuation, or, say, an uncommon text structure that causes a lot of UTF-8 fallbacks, the compression ratio will be lower.r.

The above image shows War and Peace compressed with ShrinkENG only, which results in a 52.9% file size reduction.

The same document (War and Peace.txt) compressed with ZIP results in 1.20MB. ShrinkENG + ZIP results in 1.06MB. Here is a table:

Compression Method File Size Compression Ratio
ShrinkENG 1.42MB 52.9%
ZIP 1.20MB 61.2%
ShrinkENG + ZIP 1.06MB 67.1%

Advantages Over ZIP

Using ShrinkENG to compress English text instead of ZIP compression has several advantages:

  • Compression Ratio: ShrinkENG usually compresses English text to around the same size as ZIP. Combining it with ZIP results in a smaller size than just ZIP.
  • Streamable: ShrinkENG is designed to be streamable, meaning that you can start from either the front or back of the stream and decompress an arbitrary number of words. Note: starting from the middle of the stream is technically possible but introduces challenges with keeping track of operators and UTF-8 fallbacks.
  • Lightweight Decompression: ShrinkENG is designed to be lightweight, and could probably be used to compress and decompress text (such as user messages) on the fly.

Everything's better when we work as a team...
Using ShrinkENG with ZIP results in the lowest file size.

Dictionary

The dictionary is a line-separated text file. It contains the 25,000 most common English words, dumped from wordfreq.

They are sorted by frequency. This saves space because word bytes are stored as variable length integers, so the smaller the index, the fewer bytes it takes to store..

About

A tool for compressing English language text while maintaining streamability

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published