Base64

Base64 encodes arbitrary bytes into 64 ASCII symbols so that binary data can travel through text-only systems (emails, JSON, URLs, HTML, etc.).
Therefore, it is NOT encryption, it is representation.

What is Base64 and why do we need it?

Problem: Some channels are text-only (7-bit safe). Raw bytes may be mangled. Base64 maps every 3 bytes to 4 printable characters.

Alphabet: A–Z a–z 0–9 + / (URL-safe variant uses - and _ instead of + and /).

Padding: Encoded output length is a multiple of 4 using = as padding (except some URL/MIME variants where padding may be omitted).

Standard alphabet (index → char)
0–25  : A–Z
26–51 : a–z
52–61 : 0–9
62    : +
63    : /

How encoding works (bit view)

Group input bytes into 24-bit blocks (3×8 bits). If less than 3 bytes remain, pad with zero bits.

Split 24 bits into 4 groups of 6 bits. Each 6-bit value indexes the alphabet (0–63).

Pad output with = if input length mod 3 is 1 (add ==) or 2 (add =).

Example: "Man" (ASCII: 0x4D 0x61 0x6E)

0x4D     0x61     0x6E
01001101 01100001 01101110  → 24 bits

Split 6-bit groups:
010011 010110 000101 101110 → 19, 22, 5, 46

Map to alphabet:
19=T, 22=W, 5=F, 46=u  → "TWFu"

// Minimal JS example
const encode = (s) => btoa(unescape(encodeURIComponent(s))); // UTF-8 safe
const decode = (b) => decodeURIComponent(escape(atob(b)));

console.log(encode("Man"));     // "TWFu"
console.log(decode("TWFu"));    // "Man"

Variants you’ll see in the wild

URL-safe Base64: replace + → -, / → _; padding = often omitted.

MIME Base64 (RFC 2045): inserts line breaks (typically every 76 chars) for email transport.

Base64url (RFC 4648 §5): exact spec for JWT, JWS, etc.; usually no padding.

Variant	62	63	Padding	Typical uses
Standard	+	/	`=` required	MIME, general
URL-safe	-	_	`=` often omitted	JWT, URLs, web APIs
MIME	+	/	`=` required	Email, attachments

// Base64url helpers (JS)
const toUrl = (b64) => b64.replace(/\+/g, "-").replace(/\//g, "_").replace(/=+$/,"");
const fromUrl = (u) => (u + "===".slice((u.length + 3) % 4)).replace(/-/g,"+").replace(/_/g,"/");

Code snippets in common languages

Python (base64 standard lib)

import base64

s = "你好, world"
b = s.encode("utf-8")
e = base64.b64encode(b).decode("ascii")
print(e)  # 5L2g5aW9LCDkuK3lm70=

# decode
raw = base64.b64decode(e)
print(raw.decode("utf-8"))

Node.js / Browser

// Node.js
const buf = Buffer.from("hello", "utf8");
console.log(buf.toString("base64"));          // aGVsbG8=
console.log(Buffer.from("aGVsbG8=", "base64").toString("utf8")); // hello

// Browser (UTF-8 safe)
const enc = (s) => btoa(unescape(encodeURIComponent(s)));
const dec = (b) => decodeURIComponent(escape(atob(b)));

Common pitfalls & tips

UTF-8 vs bytes: Base64 encodes bytes. If you start with text, encode to UTF-8 bytes first.

Padding: If decoding fails, you may be missing = padding (esp. base64url). Re-pad to multiple of 4.

Line wraps: MIME inserts newlines, some decoders require removing them (NO_NL options).

Security: Base64 is not encryption. Do not treat it as confidentiality. It’s easily reversible.

Size overhead: Encoded size ≈ 4/3 of input. Large payloads inflate, consider compression first.

# Quick CLI (Linux/macOS)
printf "hello" | base64           # aGVsbG8=
echo "aGVsbG8=" | base64 --decode # hello

# OpenSSL (no newlines)
printf "hello" | openssl base64 -A

Cheats

Length math:
- Encoded length = 4 * ceil(n_bytes / 3)
- Decoded max length = floor(n_b64_chars * 3 / 4)

Data URI:

data:<mime>;base64,<payload>

img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."

Validate quickly (regex):

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

Why do we need `=` padding characters?

Decoder needs a hint: Zero bits added during encoding are indistinguishable from real data after transmission. The number of = tells the decoder how many bytes to keep:
- == → last block had 1 input byte
- = → last block had 2 input bytes
- no = → last block had 3 input bytes

Concrete example: 1 input byte ("M" → 0x4D)

Input bytes (1): 4D
Pad with 16 zero bits to reach 24 bits:
01001101 00000000 00000000

Split to 6-bit groups:
010011 010000 000000 000000
    19     16      0      0   → naive alphabet: T Q A A

Naive Base64 would be "TQAA" (ambiguous).
Correct Base64 replaces the last two chars with "=" to mark missing bytes:
"TQ=="

Concrete example: 2 input bytes ("Ma" → 0x4D 0x61)

Input bytes (2): 4D 61
Pad with 8 zero bits:
01001101 01100001 00000000

Split to 6-bit groups:
010011 010110 000100 000000
    19     22      4      0   → naive alphabet: T W E A

Naive Base64 would be "TWEA".
Correct Base64 uses one "=" to mark one missing byte:
"TWE="

Why not rely on the zero bits alone?

After encoding, the decoder cannot know whether those zeros were padding or real data.

The = characters are explicit metadata that makes decoding unambiguous.

URL-safe and omitted padding: In Base64url (JWT, etc.) the = is often omitted for compactness. To decode, restore padding so length % 4 == 0 (append = as needed).

// Re-pad a Base64url string u back to standard Base64 for decoding
const repad = (u) => (u + "===".slice((u.length + 3) % 4))
                      .replace(/-/g, "+")
                      .replace(/_/g, "/");

// Examples:
repad("TQ")     // "TQ=="  (1 input byte)
repad("TWE")    // "TWE="  (2 input bytes)
repad("TWFu")   // "TWFu"  (3 input bytes, no padding)