Compression Algorithms

Introduction to Compression Algorithms

What Is Compression?

Compression is the process of reducing the size of data by removing redundancy.

Compression algorithms fall mainly into:
- Lossless compression
- Lossy compression

Lossless Compression

Lossless algorithms preserve all data exactly.
Used for text, source code, executables, database dumps, configuration files, etc.
Common techniques:
- Dictionary-based compression (e.g., LZ77, LZ78, LZW)
- Entropy coding (Huffman, Arithmetic Coding)
- Context-based models (PPM, bzip2 BWT)

Example formats:
ZIP, GZIP, PNG, FLAC, Zstandard, Brotli

Lossy Compression

Lossy algorithms remove data that is less noticeable to human senses.
Used for images, audio, and video.
Common techniques:
- Transform coding (e.g., DCT in JPEG)
- Frequency-domain reduction (e.g., MP3, AAC)
- Motion compensation + frame prediction (e.g., H.264, H.265)

Example formats:
JPEG, MP3, AAC, WebM, H.264 / H.265, AV1

Dictionary-Based Algorithms

Basic idea: Replace repeated patterns with references to earlier occurrences.

LZ77: Sliding window + references to earlier substrings.
LZ78: Builds a dictionary of repeated patterns.
LZW: Popular variant used in GIF and old compression tools.

Example:
"abcabcabc" →
(abc)(3 repeats) →
much smaller representation

Entropy Coding

Entropy coding assigns shorter bit patterns to more frequent symbols.

Huffman Coding:
- Tree-based coding method
- Guaranteed prefix-free bit patterns
Arithmetic Coding:
- Represents whole messages as a fraction
- Achieves better compression than Huffman in many cases

Example:
Frequent 'A' → 0
Less frequent 'Z' → 11011

Burrows–Wheeler Transform (BWT)

BWT rearranges text so similar characters stay close together. This makes the text more compressible with RLE or Huffman.
Used in bzip2 and modern high-compression tools.
Reversible transform—no data is lost.

Modern High-Performance Algorithms

Zstandard (zstd):
- Very fast
- Excellent compression ratio
- Widely used in Linux, Docker, Facebook systems
Brotli:
- Used in web browsers
- Designed for HTTP compression
- Better than gzip for many workloads
LZ4:
- Extremely fast
- Great for real-time logging or database compression

Speed vs Ratio:
LZ4 → fastest
Zstandard → balanced
Brotli → best ratio for web

Applications of Compression

Web assets (CSS, JS, HTML)
Data transmission (HTTP, gRPC, WebSocket)
File formats (ZIP, PNG, MP3)
Databases (PostgreSQL TOAST, RocksDB block compression)
Backups and archives
Embedded systems

Summary of Compression Types

Category	Description	Examples
Lossless	Preserves exact data	`gzip`, `PNG`, `ZIP`
Lossy	Removes perceptually unnecessary data	`JPEG`, `MP3`, `H.264`
Dictionary-based	Reuses repeated fragments	`LZ77`, `LZW`
Entropy coding	Shorter codes for frequent symbols	`Huffman`, `Arithmetic`
Modern fast	Optimized for speed + ratio	`Zstd`, `LZ4`, `Brotli`

Compression is essential for performance, storage, networking, and modern computing efficiency.
Choosing the right algorithm depends on your trade-off: speed vs ratio vs quality.

LZ77 Compression Algorithm

What Is LZ77?

LZ77, invented by Abraham Lempel and Jacob Ziv in 1977, is one of the most important and foundational lossless compression algorithms.

It is the ancestor of many modern compressors:
- gzip (DEFLATE)
- ZIP files
- PNG images
- LZMA (7zip)
- LZW (GIF)

LZ77 works by replacing repeated text with references to earlier occurrences of the same data.

It uses the concept of a sliding window over the data stream.

Core Idea: Sliding Window + Lookahead Buffer

LZ77 views the input through two windows:

search buffer which is the previously processed data
lookahead buffer which is the data we are about to encode

The encoder attempts to match the longest substring in the lookahead buffer that also appears in the search buffer.

LZ77 represents matches using a triple:

(offset, length, nextSymbol)

offset: how far back the match begins
length: how many characters match
nextSymbol: what character follows the matched block

A Minimal Example

Given input:

A B C A B C A

1. First characters (A B C) have no prior match → output literal triples:

(0, 0, A)
(0, 0, B)
(0, 0, C)

2. Remaining input: A B C A
The longest match is A B C appearing at offset 3.

(3, 3, A)

This means:
- Go back 3 characters,
- Copy 3 characters,
- Then append A.

Advantages of LZ77

Simple algorithm is easy to implement.
Effective for text and repetitive data.
Fast decoding (copying previously seen blocks).
Forms the basis of many industrial-strength compressors.

Limitations of LZ77

Compression ratio decreases on random data and non-repetitive files
Large window sizes require more memory.
The simple triple format is sometimes inefficient → improved by formats like LZSS, LZMA, and DEFLATE.

A Micro Implementation Example (Pseudo-code)

for each position i in input:
    best_length = 0
    best_offset = 0

    for each j in search_buffer:
        length = number of characters matching input[j...] and input[i...]

        if length > best_length:
            best_length = length
            best_offset = i - j

    if best_length > 0:
        emit (best_offset, best_length, input[i + best_length])
        advance i by best_length + 1
    else:
        emit (0, 0, input[i])
        advance i by 1

This matches the original LZ77 description.

LZ78 Compression Algorithm

What Is LZ78?

LZ78 is a dictionary-based compression algorithm invented by Abraham Lempel and Jacob Ziv in 1978.

It works by building a dictionary of previously seen substrings and replacing repeated substrings with:
- an index (a reference to a dictionary entry), and
- a next character.

Each compressed token has the form:

(index, nextSymbol)

index is 0 if no previous substring is matched.
nextSymbol is the character following the matched substring.

Unlike LZ77, which looks backward into a sliding window, LZ78 grows an explicit dictionary of phrases.

How the Algorithm Works (Intuition)

Start with an empty dictionary.
Read the input character-by-character, building the longest substring that already exists in the dictionary.
When the next character causes the substring to not exist in the dictionary:
- emit a token (index, nextSymbol),
- add the new substring to the dictionary.
Repeat until the input is consumed.

A Step-by-Step Example

Input:

A B C A B B C C A

Dictionary starts empty and will grow like:

1: "A"
2: "B"
3: "C"
4: "AB"
5: "BC"
6: "CA"

Compression process:

Step	Matched Substring	Next Symbol	Output	New Dictionary Entry
1	"" (none)	A	(0, A)	1: "A"
2	""	B	(0, B)	2: "B"
3	""	C	(0, C)	3: "C"
4	"A"	B	(1, B)	4: "AB"
5	"B"	C	(2, C)	5: "BC"
6	"C"	A	(3, A)	6: "CA"

Final encoded output:

(0, A)
(0, B)
(0, C)
(1, B)
(2, C)
(3, A)

And the final dictionary consists of 6 entries.

Decoding LZ78

Decoding is straightforward:
- Start with an empty dictionary.
- For each (index, symbol):
  - If index = 0, output symbol.
  - Else:
    - Fetch the dictionary entry at index,
    - output that string + symbol.
  - Add the output string as a new dictionary entry.

Token (1, B)
→ dictionary[1] = "A"
→ output "AB"
→ add "AB" as new dictionary entry

Key Differences: LZ78 vs LZ77

LZ77:
- Uses a sliding window of past characters.
- Matches substrings by offset and length.
- Dictionary is implicit (part of the data window).

LZ78:
- Builds an explicit dictionary of substrings.
- Dictionary persists and grows as the input is read.
- Each token is an index + next character.

LZ77 tends to handle long repeated patterns better without huge dictionaries.

LZ78 provides simpler decoding and is foundational for LZW (used in GIF).

Performance Notes

Compresses well when the input has many repeated substrings.
Dictionary growth can be memory-intensive.
Many real-world compressors (like LZW) refined LZ78 by:
- using variable-width codes,
- limiting dictionary size,
- handling dictionary resets.