NOTE:
I'm not a professional computer scientist, so the whole content here is based on my poor knowledge.
Therefore, it might contain inefficient thinking ways or even wrong thinking ways. I do not assume any responsibility from it.
That's it, thanks for your understanding.
What Is a Compiler?
A compiler is a program that transforms source code written in one language
(e.g., C, Rust, TypeScript) into another form (usually machine code, bytecode, or another high-level language).
Examples of compilers:
GCC / Clang
Rustc
The Big Picture: Stages of a Compiler
Lexical Analysis (tokenization)
Breaks raw text into tokens: keywords, identifiers, literals, symbols.
Example: a = 3 + b → [ID("a"), '=', INT(3), '+', ID("b")]
Syntax Analysis (parsing)
Builds a tree structure called an AST (Abstract Syntax Tree).
Example: 3 + u becomes a binary operation node.
Semantic Analysis
Checks correctness:
types
scope
variable declarations
function signatures
Example: "abc" + 5 is invalid in many languages.
Intermediate Representation (IR)
The AST is transformed into a lower-level language closer to machine operations.
Common IRs:
LLVM IR
bytecode
three-address code (TAC)
Optimization
The compiler improves your code:
dead code elimination
constant folding
loop optimizations
register allocation
Code Generation
Final stage: output machine code, bytecode, JS, etc.
Compiler vs Interpreter
Compilers: translate entire program before running.
Interpreters: execute code step-by-step.
Compiler
Interpreter
Produces binary or bytecode
Runs line-by-line
Faster execution
More flexible debugging
Example: C, Rust
Example: Python, Ruby
The AST (Abstract Syntax Tree)
The AST is the core internal structure produced by the parser.
Example AST for 1 + 2 * 3:
Add
├── Int(1)
└── Mul
├── Int(2)
└── Int(3)
The tree respects precedence: multiplication evaluated before addition.
Intermediate Representation (IR)
An IR is a lower-level representation used for:
optimization
analysis
machine-code generation
A simple “three-address code” example:
t1 = 2 * 3
t2 = 1 + t1
LLVM IR is a real-world example:
%t1 = mul i32 2, 3
%t2 = add i32 1, %t1
Optimizations
Compilers apply many optimizations automatically:
constant folding: 2 * 3 → 6
dead code elimination: remove unused variables
inlining: replace function calls with their bodies
loop unrolling
common subexpression elimination
Back Ends and Code Generation
The back end turns IR into:
machine code (x86, ARM)
bytecode (Python, Java)
another language (TS → JS)
Example: LLVM back end produces machine instructions:
mov eax, 2
imul eax, 3
add eax, 1
Front-End vs Back-End
Front-End
Back-End
Lexing
IR to machine code
Parsing
Register allocation
Semantic analysis
Optimization passes
AST generation
Instruction scheduling
Different Phases of a Compiler
Prior Knowledge
A compiler transforms source code written in a high-level language (like C, Rust, or Haskell) into machine code or an intermediate form.
This transformation happens in several well-structured phases.
Each phase handles one part of the job:
understanding the program,
checking for correctness,
optimizing,
and generating low-level code.
Although implementations vary between compilers (e.g., GCC, LLVM Clang, Rustc), the overall architecture remains similar.
Lexical Analysis (Scanning)
The first phase of compilation.
Converts a raw source code string into a sequence of tokens.
A token is a meaningful atomic unit:
keywords (if, while)
identifiers (x, count)
literals (10, "hello")
operators (+, *, ==)
punctuation (;, (, ))
A lexer removes:
spaces,
comments,
line breaks,
unnecessary formatting.
Example:
Source: x = 3 + 5;
Tokens: IDENTIFIER("x"), EQUALS, NUMBER(3), PLUS, NUMBER(5), SEMICOLON
Or in more details as
Tokens: IDENTIFIER("x"), OPERATOR("="), NUMBER(3), OPERATOR("+"), NUMBER(5), PUNCTUATION(";")
Syntax Analysis (Parsing)
Given the sequence of tokens, the parser checks whether the code follows the grammatical structure of the language.
It produces an AST (Abstract Syntax Tree), a hierarchical representation of the program.
Example: x = 3 + 5 becomes:
Assignment
/ \
Variable Add
| / \
"x" 3 5
The parser enforces:
operator precedence,
parentheses grouping,
valid statements and expressions.
Parser types:
Top-down: LL(k), recursive-descent (easy to implement)
Bottom-up: LR(k), LALR, GLR (used by many production compilers)
Semantic Analysis
The semantic analyzer checks for meaningful correctness of the AST.
Examples of semantic checks:
Type checking
Variable declarations and scopes
Function arity (argument count)
Operator/type compatibility
Control-flow correctness
The compiler builds:
a symbol table mapping names to types/locations,
a type environment,
a scope tree.
Example error caught here:
x = "hello" + 5 -- type error: cannot add string and integer
Semantic analysis transforms the AST into a typed AST (or annotates nodes with type info).
Intermediate Representation (IR) Generation
Once semantics are validated, the compiler transforms the AST into an IR (Intermediate Representation).
IR is a lower-level, more regular representation than the AST.
Most modern compilers use a form of SSA (Static Single Assignment) IR.
Purpose:
enable optimizations,
provide target-independence,
simplify code generation.
Example (LLVM IR):
%1 = add i32 3, 5
store i32 %1, i32* @x
The IR removes high-level syntax and resolves many abstractions.
Optimization
This is often the largest and most complex phase.
Perform optimizations on IR to:
minimize runtime,
minimize memory,
increase CPU instruction-level parallelism,
reduce branches and allocations.
Typical optimizations include:
Constant folding
Dead code elimination
Loop unrolling
Function inlining
Common subexpression elimination
Escape analysis (eliminate heap allocations)
Copy propagation
Optimizations may occur at:
high-level AST,
mid-level IR (LLVM),
low-level machine code.
Code Generation
Transforms the optimized IR into:
target machine code,
assembly, or
bytecode (for JVM, CLR, BEAM, etc.).
This phase handles:
register allocation,
calling conventions,
instruction selection,
stack frame setup,
function prologues/epilogues.
Example (x86-64 assembly):
mov eax, 3
add eax, 5
mov [x], eax
Linking
Final phase: combining all compiled units and libraries into an executable.
The linker resolves:
symbol references,
external functions (e.g., printf),
imported modules,
static libraries,
shared libraries.
For languages like C/C++, linking is separate.
For languages like Go, Rust, and Haskell, the compiler internally orchestrates linking.
The Big Picture Diagram
Source Code
|
v
[Lexical Analysis]
|
v
[Syntax Parsing]
|
v
[Semantic Analysis]
|
v
[Intermediate Representation]
|
v
[Optimization]
|
v
[Code Generation]
|
v
[Linking]
|
v
Executable
Introduction to the Lexical Analyzer
What Is a Lexical Analyzer?
A lexical analyzer, also called a lexer or scanner, is the first phase of a compiler.
Its job is to transform raw source code (a stream of characters) into a stream of tokens.
A token is a structured, meaningful element such as:
Keywords: if, let, while
Identifiers: x, totalAmount
Literals: 42, "hello", true
Operators: +, -, ==, =
Punctuation: ;, (, ), {, }
The lexer removes:
whitespace (spaces, tabs, newlines),
comments,
formatting irrelevant to syntax.
The output of a lexer is a clean token stream ready for parsing.
Why Is Lexical Analysis Needed?
Languages are easier to reason about when reduced to meaningful units.
A parser should process tokens, not raw characters.
Lexers simplify parsing by:
grouping multi-character operators (==, >=)
handling keywords vs identifiers
normalizing whitespace
detecting early lexical errors (invalid characters)
Without a lexer, parsers become extremely complicated.
The lexer is also responsible for catching invalid characters:
abc @ def
^
error: unexpected character '@'
If an invalid lexeme is encountered, compilation stops at the lexical phase.
There are several common ways to build a lexer:
Regex-based lexers
DFA-based scanners
Hand-written lexers — common in simpler compilers
Longest Match Rule
Lexers usually apply the maximal match principle:
Input: >=
Lexer picks: ">=" (not just ">")
This ensures ambiguous prefixes are handled properly.
Introduction to Deterministic Finite Automata (DFA)
What Is a DFA? (In Simple Words, About Strict Definitions Please Go Read A Professional Book)
A Deterministic Finite Automaton (DFA) is a simple mathematical machine used to recognize and process patterns in strings.
A DFA reads an input string character by character and uses states to decide what to do next.
If the machine ends in an accept state, then the string matches the pattern.
Formal Definition of a DFA
A DFA consists of 5 components:
DFA = (Q, Σ, δ, q₀, F)
Q = set of states
Σ = input alphabet (characters allowed)
δ = transition function: Q × Σ → Q
q₀ = start state
F = set of accepting states
Example 1: DFA for Recognizing Strings That End With "ab"
Alphabet: {a, b}
Language: All strings that end with "ab"
Examples accept: "ab", "aab", "bbab", "aaab"
Examples reject: "a", "ba", "abb", "bbaa"
This DFA checks whether the number of 0 characters is even.
Alphabet: {0, 1}
Accept: strings with even number of 0s
State
On 0
On 1
Even (start, accept)
Odd
Even
Odd
Even
Odd
State meaning:
Even: seen even number of 0s (accept state)
Odd: seen odd number of 0s
Example: 10100
Even --1--> Even --0--> Odd --1--> Odd --0--> Even --0--> Odd (reject)
Transition Table Representation
A DFA can also be represented using a table.
For the "ends with ab" automaton:
| a | b
--------------------------------
q0 | q1 | q0
q1 | q1 | q2
q2 | q1 | q0
The table precisely defines the behaviour of the machine.
DFA Simulation (Pseudocode)
Every DFA can be simulated by repeatedly applying the transition function.
def run_dfa(trans, start, accepts, s):
state = start
for c in s:
state = trans[state][c]
return state in accepts
transition = {
"q0": {"a": "q1", "b": "q0"},
"q1": {"a": "q1", "b": "q2"},
"q2": {"a": "q1", "b": "q0"},
}
print(run_dfa(transition, "q0", {"q2"}, "aab")) // True
DFA vs NFA (Briefly)
A DFA has:
exactly one transition per input symbol,
exactly one active state at any time.
An NFA (nondeterministic finite automaton) may have:
multiple possible transitions,
ε-transitions (move without input),
multiple active states.
But both recognize exactly the same class of languages (regular languages).
DFA: Strings Ending with "ab"
State
On a
On b
q0 (start)
q1
q0
q1
q1
q2
q2 (accept)
q1
q0
NFA: Strings Ending with "ab"
State
On a
On b
q0 (start)
{q0, q1}
{q0}
q1
{q1}
{q2}
q2 (accept)
{q1}
{q0}
Left Factoring
What Is Left Factoring?
Left factoring is a grammar transformation technique used in compiler design.
It's essential for building predictive parsers (like LL(1) parsers) that need to decide which production to use by looking at only one token ahead.
When two or more productions for the same non-terminal start with the same symbol(s), the parser cannot decide which rule to apply.
Left factoring extracts the common prefix into a single production, deferring the decision until after the prefix is consumed.
Why Do We Need It?
Consider a predictive parser looking at input and trying to choose a production:
Grammar:
A -> α β₁
A -> α β₂
Problem: When parser sees the start of α, it cannot decide between the two rules!
The parser would need to "look ahead" beyond α to make a decision — but LL(1) parsers only look at one token.
Left factoring solves this by restructuring the grammar:
After left factoring:
A -> α A'
A' -> β₁
A' -> β₂
Now: Parser consumes α first, THEN decides between β₁ and β₂.
The General Algorithm
For each non-terminal A with productions that share a common prefix α:
Before:
A -> α β₁ | α β₂ | ... | α βₙ | γ
After:
A -> α A' | γ
A' -> β₁ | β₂ | ... | βₙ
Where γ represents any alternatives that don't share the prefix α.
If βᵢ is empty for some i, then A' -> ε (epsilon) is one of the alternatives.
Example 1: Simple Arithmetic Expression
A classic example from expression parsing:
Before (problematic):
expr -> term + expr
expr -> term
Problem: Both start with "term" — parser can't decide!
After left factoring:
After (left-factored):
expr -> term expr'
expr' -> + expr
expr' -> ε
Visualization of the transformation:
Example 2: The Dangling Else Problem
This is a famous ambiguity in programming language grammars:
Before (problematic):
stmt -> if ( expr ) stmt else stmt
stmt -> if ( expr ) stmt
stmt -> while ( expr ) stmt
stmt -> id = expr ;
Problem: Both "if" rules share a long common prefix!
After left factoring:
After (left-factored):
stmt -> if ( expr ) stmt stmt'
stmt -> while ( expr ) stmt
stmt -> id = expr ;
stmt' -> else stmt
stmt' -> ε
The common prefix if ( expr ) stmt is factored out.
The decision between "else" and "nothing" is deferred to stmt'.
Input After Parsing if(expr)stmt
Action
Next token is else
Use stmt' -> else stmt
Next token is something else
Use stmt' -> ε
Example 3: Multiple Groups of Common Prefixes
Sometimes you have multiple different prefixes that each need factoring:
Before (problematic):
decl -> int id ;
decl -> int id [ num ] ;
decl -> float id ;
decl -> float id [ num ] ;
Here we have TWO groups of common prefixes: int id and float id.
First pass — factor out within each group:
After (left-factored):
decl -> int id decl'
decl -> float id decl'
decl' -> ;
decl' -> [ num ] ;
We can go even further by factoring out the type:
Even more factored (optional):
decl -> type id decl'
type -> int
type -> float
decl' -> ;
decl' -> [ num ] ;
Example 4: Three or More Alternatives
When more than two alternatives share a prefix, you may need multiple passes:
A more complex example showing how factoring can require multiple iterations:
Before (problematic):
S -> a b c d
S -> a b c e
S -> a b f
S -> a g
S -> h
This requires THREE passes to fully factor. Here's the progression:
Pass
Common Prefix Found
Result
1
a
Extract S' for what follows a
2
b (in S')
Extract S'' for what follows b
3
c (in S'')
Extract S''' for what follows c
After (pass 1):
S -> a S'
S -> h
S' -> b c d
S' -> b c e
S' -> b f
S' -> g
After (pass 2):
S -> a S'
S -> h
S' -> b S''
S' -> g
S'' -> c d
S'' -> c e
S'' -> f
After (pass 3 — final):
S -> a S'
S -> h
S' -> b S''
S' -> g
S'' -> c S'''
S'' -> f
S''' -> d
S''' -> e
Visualizing the final grammar as a decision tree:
Example 6: A Realistic Function Definition Grammar
Real programming languages have complex grammars. Here's a function definition:
Before (problematic):
func -> id ( ) { stmts }
func -> id ( ) : type { stmts }
func -> id ( params ) { stmts }
func -> id ( params ) : type { stmts }
All four start with id ( — a parser seeing foo( has no idea which rule to use!
After careful factoring:
After (left-factored):
func -> id ( func'
func' -> ) func''
func' -> params ) func''
func'' -> { stmts }
func'' -> : type { stmts }
Decision points:
After Seeing
Look At
Decision
id (
Next token
) → no params; otherwise → has params
id ( ... )
Next token
{ → no return type; : → has return type
When Left Factoring Is Not Enough
Left factoring only helps with common prefixes.
It does NOT solve:
Left recursion: A -> A α | β (needs different transformation)
Ambiguity: when the grammar genuinely allows multiple parse trees
Non-LL(1) grammars: some grammars simply cannot be made LL(1)
Problem
Solution
Common prefixes
Left factoring ✓
Left recursion
Left recursion elimination
Ambiguity
Rewrite grammar or use precedence rules
Inherently non-LL(1)
Use more powerful parser (LR, GLR, etc.)
Summary
Left factoring transforms: A -> αβ₁ | αβ₂ into A -> αA' and A' -> β₁ | β₂
It enables predictive (LL) parsing by deferring decisions until after common prefixes.
May require multiple passes for deeply nested common prefixes.
Often used together with left recursion elimination when building LL(1) parsers.
Predictive Parsing
What Is Predictive Parsing?
Predictive parsing is what you get when you've successfully transformed your grammar
(using left factoring, left recursion elimination, etc.) so that one token of lookahead
is always sufficient to make the right choice—no guessing, no backtracking, no regrets.
A predictive parser is a top-down parser that never backtracks.
At each step, it looks at the current non-terminal and the next input token, then deterministically chooses which production to use.
This is only possible when the grammar has been carefully prepared (no left recursion, no common prefixes).
The Promise of Predictive Parsing:
Current state: I'm trying to expand non-terminal A
I see token t in the input
↓
I know EXACTLY which production to use (or that there's an error)
↓
No guessing. No trying one rule and backtracking if it fails.
The Key Insight: FIRST Sets
For predictive parsing to work, we need to know: "What tokens can a production start with?"
This is captured by the FIRST set.
Definition:
FIRST(α) = the set of all terminals (tokens) that can appear
as the first symbol of any string derived from α
If α can derive ε (empty string), then ε ∈ FIRST(α) as well.
Simple examples:
FIRST(if ( expr ) stmt) = { if } — always starts with "if"
FIRST(while ( expr ) stmt) = { while } — always starts with "while"
FIRST(id = expr ;) = { id } — always starts with an identifier
FIRST(+ term) = { + } — always starts with "+"
FIRST(ε) = { ε } — the empty string "starts with" nothing
The Golden Rule for Predictive Parsing
If A -> α and A -> β are two different productions for the same non-terminal:
FIRST(α) ∩ FIRST(β) = ∅ (must be empty!)
If they overlap, seeing a token in the overlap wouldn't tell you which production to use.
When this condition holds for all non-terminals, the grammar is suitable for predictive parsing.
Example: Good Grammar (Predictive Parsing Works)
Grammar:
stmt -> if ( expr ) stmt
stmt -> while ( expr ) stmt
stmt -> id = expr ;
FIRST sets:
FIRST(if ( expr ) stmt) = { if }
FIRST(while ( expr ) stmt) = { while }
FIRST(id = expr ;) = { id }
Checking for overlap:
Pair
Intersection
Result
{ if } ∩ { while }
∅
✓
{ if } ∩ { id }
∅
✓
{ while } ∩ { id }
∅
✓
No overlap! The parser can always decide:
See "if" → use stmt -> if ( expr ) stmt
See "while" → use stmt -> while ( expr ) stmt
See "id" → use stmt -> id = expr ;
See anything else → syntax error!
Example: Bad Grammar (Predictive Parsing Fails)
Grammar:
expr -> term + expr
expr -> term
Assume term can start with digits (0-9):
FIRST(term + expr) = { 0, 1, 2, ..., 9 }
FIRST(term) = { 0, 1, 2, ..., 9 }
If the parser sees 5, it has no idea whether to use expr -> term + expr or expr -> term.
This is exactly why we need left factoring!
Computing FIRST Sets: The Algorithm
For a terminal a: FIRST(a) = { a }
For epsilon: FIRST(ε) = { ε }
For a non-terminal A, look at all its productions and take the union.
For a sequence X₁ X₂ ... Xₙ:
FIRST(X₁ X₂ ... Xₙ):
1. Start with FIRST(X₁) (excluding ε)
2. If ε ∈ FIRST(X₁), also add FIRST(X₂) (excluding ε)
3. If ε ∈ FIRST(X₁) and ε ∈ FIRST(X₂), also add FIRST(X₃) ...
4. Continue until you hit an Xᵢ where ε ∉ FIRST(Xᵢ)
5. If ALL of X₁...Xₙ can derive ε, then add ε to the result
Example calculation:
Grammar:
E -> T E'
E' -> + T E' | ε
T -> F T'
T' -> * F T' | ε
F -> ( E ) | id
Computing FIRST sets (iteratively):
FIRST(F) = { (, id }
FIRST(T') = { *, ε }
FIRST(T) = FIRST(F) = { (, id }
FIRST(E') = { +, ε }
FIRST(E) = FIRST(T) = { (, id }
FOLLOW Sets: Handling Epsilon
When a production can derive ε, we also need to know what can follow that non-terminal.
FOLLOW(A) = the set of terminals that can appear immediately after A in some derivation.
Why we need FOLLOW:
E' -> + T E'
E' -> ε
If we see something NOT in FIRST(+ T E'), should we use E' -> ε?
Only if that something is in FOLLOW(E')!
FOLLOW set rules:
1. $ ∈ FOLLOW(S) where S is the start symbol ($ = end of input)
2. If A -> α B β, then:
FIRST(β) - {ε} ⊆ FOLLOW(B)
3. If A -> α B, or A -> α B β where ε ∈ FIRST(β), then:
FOLLOW(A) ⊆ FOLLOW(B)
The Parsing Table
Once we have FIRST and FOLLOW sets, we can build a parsing table.
The table tells us: "Given non-terminal A and lookahead token t, which production to use?"
Table construction:
For each production A -> α:
1. For each terminal a ∈ FIRST(α), add A -> α to Table[A, a]
2. If ε ∈ FIRST(α), for each terminal b ∈ FOLLOW(A), add A -> α to Table[A, b]
Example parsing table for arithmetic expressions:
id
+
*
(
)
$
E
E -> T E'
E -> T E'
E'
E' -> + T E'
E' -> ε
E' -> ε
T
T -> F T'
T -> F T'
T'
T' -> ε
T' -> * F T'
T' -> ε
T' -> ε
F
F -> id
F -> ( E )
Empty cells = syntax error. Multiple entries in one cell = grammar is not LL(1).
The Predictive Parsing Algorithm
Uses a stack to track what we're expecting to see.
Matches terminals, expands non-terminals using the table.
def predictive_parse(input_tokens, table, start_symbol):
stack = ['$', start_symbol] # $ marks bottom
tokens = input_tokens + ['$']
pos = 0
while stack:
top = stack.pop()
current = tokens[pos]
if top == '$':
if current == '$':
return True # Success!
else:
return False # Error: extra input
elif top is terminal:
if top == current:
pos += 1 # Match! Consume token
else:
return False # Error: mismatch
else: # top is non-terminal
if (top, current) in table:
production = table[(top, current)]
# Push RHS in reverse order
for symbol in reversed(production.rhs):
if symbol != 'ε':
stack.append(symbol)
else:
return False # Error: no rule
return pos == len(tokens) - 1
Trace: Parsing id + id * id
Let's trace through parsing the expression id + id * id: