Introduction to Compiler Design

  1. NOTE:
    I'm not a professional computer scientist, so the whole content here is based on my poor knowledge.
    Therefore, it might contain inefficient thinking ways or even wrong thinking ways. I do not assume any responsibility from it.
    That's it, thanks for your understanding.

  2. What Is a Compiler?


  3. The Big Picture: Stages of a Compiler

    1. Lexical Analysis (tokenization)

    2. Syntax Analysis (parsing)

    3. Semantic Analysis

    4. Intermediate Representation (IR)

    5. Optimization

    6. Code Generation


  4. Compiler vs Interpreter


  5. Compiler Interpreter
    Produces binary or bytecode Runs line-by-line
    Faster execution More flexible debugging
    Example: C, Rust Example: Python, Ruby


  6. The AST (Abstract Syntax Tree)

  7. Add
     ├── Int(1)
     └── Mul
          ├── Int(2)
          └── Int(3)
    


  8. Intermediate Representation (IR)

  9. t1 = 2 * 3
    t2 = 1 + t1
    
    %t1 = mul i32 2, 3
    %t2 = add i32 1, %t1
    


  10. Optimizations



  11. Back Ends and Code Generation

  12. mov eax, 2
    imul eax, 3
    add eax, 1
    


  13. Front-End vs Back-End

  14. Front-End Back-End
    Lexing IR to machine code
    Parsing Register allocation
    Semantic analysis Optimization passes
    AST generation Instruction scheduling



Different Phases of a Compiler

  1. Prior Knowledge



  2. Lexical Analysis (Scanning)

  3. Source:     x = 3 + 5;
    Tokens:     IDENTIFIER("x"), EQUALS, NUMBER(3), PLUS, NUMBER(5), SEMICOLON
    
    Or in more details as
    Tokens:     IDENTIFIER("x"), OPERATOR("="), NUMBER(3), OPERATOR("+"), NUMBER(5), PUNCTUATION(";")
    


  4. Syntax Analysis (Parsing)

  5.       Assignment
         /          \
       Variable     Add
        |          /   \
       "x"        3     5
    



  6. Semantic Analysis

  7. x = "hello" + 5  -- type error: cannot add string and integer
    



  8. Intermediate Representation (IR) Generation

  9. %1 = add i32 3, 5
    store i32 %1, i32* @x
    



  10. Optimization



  11. Code Generation

  12. mov eax, 3
    add eax, 5
    mov [x], eax
    


  13. Linking



  14. The Big Picture Diagram
  15.     Source Code
             |
             v
      [Lexical Analysis]
             |
             v
       [Syntax Parsing]
             |
             v
     [Semantic Analysis]
             |
             v
     [Intermediate Representation]
             |
             v
         [Optimization]
             |
             v
      [Code Generation]
             |
             v
          [Linking]
             |
             v
        Executable
    



Introduction to the Lexical Analyzer

  1. What Is a Lexical Analyzer?



  2. Why Is Lexical Analysis Needed?



  3. The Lexical Unit: Tokens

  4. Token {
        type: IDENTIFIER,
        lexeme: "total",
        literal: null,
        line: 3
    }
    



  5. How the Lexer Works

  6. Example:
    Input:   x = 12 + 7
    Output:  IDENT("x"), EQUALS, NUMBER(12), PLUS, NUMBER(7)
    


  7. Token Categories (Example)

  8. myVar, count1, is_valid
    if, else, while, return
    42, 3.14, "abc", true
    +, -, *, /, ==, !=, >=, <=, =
    ;, ,, (, ), {, }
    // single-line
    /* multi-line */
    


  9. Lexical Errors

  10. abc @ def
        ^
        error: unexpected character '@'
    



  11. There are several common ways to build a lexer:


  12. Longest Match Rule

  13. Input:           >=
    Lexer picks:    ">="   (not just ">")
    




Introduction to Deterministic Finite Automata (DFA)

  1. What Is a DFA? (In Simple Words, About Strict Definitions Please Go Read A Professional Book)



  2. Formal Definition of a DFA

  3. DFA = (Q, Σ, δ, q₀, F)
    
    Q   = set of states
    Σ   = input alphabet (characters allowed)
    δ   = transition function: Q × Σ → Q
    q₀  = start state
    F   = set of accepting states
    


  4. Example 1: DFA for Recognizing Strings That End With "ab"

  5. Alphabet: {a, b}
    Language: All strings that end with "ab"
    Examples accept:  "ab", "aab", "bbab", "aaab"
    Examples reject:  "a", "ba", "abb", "bbaa"
    





  6. Example 2: DFA for Numbers with Even Number of 0s

  7. Alphabet: {0, 1}
    Accept: strings with even number of 0s
    

    State On 0 On 1
    Even (start, accept) Odd Even
    Odd Even Odd



  8. Transition Table Representation

  9.        |   a   |   b
    --------------------------------
    q0     |  q1   |  q0
    q1     |  q1   |  q2
    q2     |  q1   |  q0
    



  10. DFA Simulation (Pseudocode)

  11. def run_dfa(trans, start, accepts, s):
        state = start
        for c in s:
            state = trans[state][c]
        return state in accepts
    
    transition = {
        "q0": {"a": "q1", "b": "q0"},
        "q1": {"a": "q1", "b": "q2"},
        "q2": {"a": "q1", "b": "q0"},
    }
    
    print(run_dfa(transition, "q0", {"q2"}, "aab"))   // True
    


  12. DFA vs NFA (Briefly)




  13. DFA: Strings Ending with "ab"
    State On a On b
    q0 (start) q1 q0
    q1 q1 q2
    q2 (accept) q1 q0


    NFA: Strings Ending with "ab"
    State On a On b
    q0 (start) {q0, q1} {q0}
    q1 {q1} {q2}
    q2 (accept) {q1} {q0}


Left Factoring

  1. What Is Left Factoring?



  2. Why Do We Need It?

  3. Grammar:
        A -> α β₁
        A -> α β₂
    
    Problem: When parser sees the start of α, it cannot decide between the two rules!
    
    After left factoring:
        A  -> α A'
        A' -> β₁
        A' -> β₂
    
    Now: Parser consumes α first, THEN decides between β₁ and β₂.
    


  4. The General Algorithm

  5. Before:
        A -> α β₁ | α β₂ | ... | α βₙ | γ
    
    After:
        A  -> α A' | γ
        A' -> β₁ | β₂ | ... | βₙ
    


  6. Example 1: Simple Arithmetic Expression

  7. Before (problematic):
        expr -> term + expr
        expr -> term
    
    Problem: Both start with "term" — parser can't decide!
    

    After (left-factored):
        expr  -> term expr'
        expr' -> + expr
        expr' -> ε
    


    Before expr term + expr term ⚠ Common prefix "term" After expr term expr' expr' + expr ε ✓ Decision deferred to expr'

  8. Example 2: The Dangling Else Problem

  9. Before (problematic):
        stmt -> if ( expr ) stmt else stmt
        stmt -> if ( expr ) stmt
        stmt -> while ( expr ) stmt
        stmt -> id = expr ;
    
    Problem: Both "if" rules share a long common prefix!
    

    After (left-factored):
        stmt  -> if ( expr ) stmt stmt'
        stmt  -> while ( expr ) stmt
        stmt  -> id = expr ;
        stmt' -> else stmt
        stmt' -> ε
    


    Input After Parsing if(expr)stmt Action
    Next token is else Use stmt' -> else stmt
    Next token is something else Use stmt' -> ε


  10. Example 3: Multiple Groups of Common Prefixes

  11. Before (problematic):
        decl -> int id ;
        decl -> int id [ num ] ;
        decl -> float id ;
        decl -> float id [ num ] ;
    

    After (left-factored):
        decl  -> int id decl'
        decl  -> float id decl'
        decl' -> ;
        decl' -> [ num ] ;
    

    Even more factored (optional):
        decl  -> type id decl'
        type  -> int
        type  -> float
        decl' -> ;
        decl' -> [ num ] ;
    


  12. Example 4: Three or More Alternatives

  13. Before (problematic):
        print_stmt -> print expr ;
        print_stmt -> print expr , expr ;
        print_stmt -> print expr , expr , expr ;
    

    After (first pass):
        print_stmt  -> print expr print_stmt'
        print_stmt' -> ;
        print_stmt' -> , expr ;
        print_stmt' -> , expr , expr ;
    

    After (second pass):
        print_stmt   -> print expr print_stmt'
        print_stmt'  -> ;
        print_stmt'  -> , expr print_stmt''
        print_stmt'' -> ;
        print_stmt'' -> , expr ;
    


  14. Example 5: Nested Factoring (Iterative Process)

  15. Before (problematic):
        S -> a b c d
        S -> a b c e
        S -> a b f
        S -> a g
        S -> h
    


    Pass Common Prefix Found Result
    1 a Extract S' for what follows a
    2 b (in S') Extract S'' for what follows b
    3 c (in S'') Extract S''' for what follows c

    After (pass 1):
        S  -> a S'
        S  -> h
        S' -> b c d
        S' -> b c e
        S' -> b f
        S' -> g
    
    After (pass 2):
        S   -> a S'
        S   -> h
        S'  -> b S''
        S'  -> g
        S'' -> c d
        S'' -> c e
        S'' -> f
    
    After (pass 3 — final):
        S    -> a S'
        S    -> h
        S'   -> b S''
        S'   -> g
        S''  -> c S'''
        S''  -> f
        S''' -> d
        S''' -> e
    


    S a h S' h b g S'' g c f S''' f d e d e Non-terminal Terminal

  16. Example 6: A Realistic Function Definition Grammar

  17. Before (problematic):
        func -> id ( ) { stmts }
        func -> id ( ) : type { stmts }
        func -> id ( params ) { stmts }
        func -> id ( params ) : type { stmts }
    

    After (left-factored):
        func    -> id ( func'
        func'   -> ) func''
        func'   -> params ) func''
        func''  -> { stmts }
        func''  -> : type { stmts }
    


    After Seeing Look At Decision
    id ( Next token ) → no params; otherwise → has params
    id ( ... ) Next token { → no return type; : → has return type


  18. When Left Factoring Is Not Enough


  19. Problem Solution
    Common prefixes Left factoring ✓
    Left recursion Left recursion elimination
    Ambiguity Rewrite grammar or use precedence rules
    Inherently non-LL(1) Use more powerful parser (LR, GLR, etc.)


  20. Summary




Predictive Parsing

  1. What Is Predictive Parsing?

  2. The Promise of Predictive Parsing:
    
        Current state: I'm trying to expand non-terminal A
        I see token t in the input
        ↓
        I know EXACTLY which production to use (or that there's an error)
        ↓
        No guessing. No trying one rule and backtracking if it fails.
    


  3. The Key Insight: FIRST Sets

  4. Definition:
    
        FIRST(α) = the set of all terminals (tokens) that can appear
                   as the first symbol of any string derived from α
    
        If α can derive ε (empty string), then ε ∈ FIRST(α) as well.
    

    FIRST(if ( expr ) stmt)  = { if }       — always starts with "if"
    FIRST(while ( expr ) stmt) = { while }  — always starts with "while"
    FIRST(id = expr ;)       = { id }       — always starts with an identifier
    FIRST(+ term)            = { + }        — always starts with "+"
    FIRST(ε)                 = { ε }        — the empty string "starts with" nothing
    


  5. The Golden Rule for Predictive Parsing

  6. FIRST(α) ∩ FIRST(β) = ∅    (must be empty!)
    
    If they overlap, seeing a token in the overlap wouldn't tell you which production to use.
    



  7. Example: Good Grammar (Predictive Parsing Works)

  8. Grammar:
        stmt -> if ( expr ) stmt
        stmt -> while ( expr ) stmt
        stmt -> id = expr ;
    
    FIRST sets:
        FIRST(if ( expr ) stmt)     = { if }
        FIRST(while ( expr ) stmt)  = { while }
        FIRST(id = expr ;)          = { id }
    


    Pair Intersection Result
    { if } ∩ { while }
    { if } ∩ { id }
    { while } ∩ { id }

    See "if"    → use stmt -> if ( expr ) stmt
    See "while" → use stmt -> while ( expr ) stmt
    See "id"    → use stmt -> id = expr ;
    See anything else → syntax error!
    


  9. Example: Bad Grammar (Predictive Parsing Fails)

  10. Grammar:
        expr -> term + expr
        expr -> term
    
    Assume term can start with digits (0-9):
        FIRST(term + expr) = { 0, 1, 2, ..., 9 }
        FIRST(term)        = { 0, 1, 2, ..., 9 }
    

    FIRST(term + expr) ∩ FIRST(term) = { 0, 1, 2, ..., 9 }
    
                                        ↑ Complete overlap!
    


    Before (Bad) expr Sees token "5" Which rule? 🤷 term + expr ? term ? left factor After (Good) expr term expr' expr' Sees "+" or end Now it's clear! ✓ + expr ε

  11. Computing FIRST Sets: The Algorithm

  12. FIRST(X₁ X₂ ... Xₙ):
    
        1. Start with FIRST(X₁) (excluding ε)
        2. If ε ∈ FIRST(X₁), also add FIRST(X₂) (excluding ε)
        3. If ε ∈ FIRST(X₁) and ε ∈ FIRST(X₂), also add FIRST(X₃) ...
        4. Continue until you hit an Xᵢ where ε ∉ FIRST(Xᵢ)
        5. If ALL of X₁...Xₙ can derive ε, then add ε to the result
    

    Grammar:
        E  -> T E'
        E' -> + T E' | ε
        T  -> F T'
        T' -> * F T' | ε
        F  -> ( E ) | id
    
    Computing FIRST sets (iteratively):
        FIRST(F)  = { (, id }
        FIRST(T') = { *, ε }
        FIRST(T)  = FIRST(F) = { (, id }
        FIRST(E') = { +, ε }
        FIRST(E)  = FIRST(T) = { (, id }
    


  13. FOLLOW Sets: Handling Epsilon

  14. Why we need FOLLOW:
    
        E' -> + T E'
        E' -> ε
    
        If we see something NOT in FIRST(+ T E'), should we use E' -> ε?
        Only if that something is in FOLLOW(E')!
    

    1. $ ∈ FOLLOW(S) where S is the start symbol ($ = end of input)
    
    2. If A -> α B β, then:
       FIRST(β) - {ε} ⊆ FOLLOW(B)
    
    3. If A -> α B, or A -> α B β where ε ∈ FIRST(β), then:
       FOLLOW(A) ⊆ FOLLOW(B)
    


  15. The Parsing Table

  16. Table construction:
    
        For each production A -> α:
            1. For each terminal a ∈ FIRST(α), add A -> α to Table[A, a]
            2. If ε ∈ FIRST(α), for each terminal b ∈ FOLLOW(A), add A -> α to Table[A, b]
    


    id + * ( ) $
    E E -> T E' E -> T E'
    E' E' -> + T E' E' -> ε E' -> ε
    T T -> F T' T -> F T'
    T' T' -> ε T' -> * F T' T' -> ε T' -> ε
    F F -> id F -> ( E )



  17. The Predictive Parsing Algorithm

  18. def predictive_parse(input_tokens, table, start_symbol):
        stack = ['$', start_symbol]  # $ marks bottom
        tokens = input_tokens + ['$']
        pos = 0
    
        while stack:
            top = stack.pop()
            current = tokens[pos]
    
            if top == '$':
                if current == '$':
                    return True  # Success!
                else:
                    return False  # Error: extra input
    
            elif top is terminal:
                if top == current:
                    pos += 1  # Match! Consume token
                else:
                    return False  # Error: mismatch
    
            else:  # top is non-terminal
                if (top, current) in table:
                    production = table[(top, current)]
                    # Push RHS in reverse order
                    for symbol in reversed(production.rhs):
                        if symbol != 'ε':
                            stack.append(symbol)
                else:
                    return False  # Error: no rule
    
        return pos == len(tokens) - 1
    


  19. Trace: Parsing id + id * id


  20. Stack Input Action
    $ E id + id * id $ E -> T E' (Table[E, id])
    $ E' T id + id * id $ T -> F T' (Table[T, id])
    $ E' T' F id + id * id $ F -> id (Table[F, id])
    $ E' T' id id + id * id $ Match id
    $ E' T' + id * id $ T' -> ε (Table[T', +])
    $ E' + id * id $ E' -> + T E' (Table[E', +])
    $ E' T + + id * id $ Match +
    $ E' T id * id $ T -> F T'
    $ E' T' F id * id $ F -> id
    $ E' T' id id * id $ Match id
    $ E' T' * id $ T' -> * F T'
    $ E' T' F * * id $ Match *
    $ E' T' F id $ F -> id
    $ E' T' id id $ Match id
    $ E' T' $ T' -> ε
    $ E' $ E' -> ε
    $ $ Accept! ✓


  21. LL(1) — What Does It Mean?


  22. Letter Meaning
    L Left-to-right scan of input
    L Leftmost derivation (expand leftmost non-terminal first)
    1 One token of lookahead



  23. After Parsing: Intermediate Representation


  24. Tree-Based IR (AST) Linear IR (Three-Address Code)
    Preserves hierarchical, nested structure Flattens into a sequence of instructions
    Each node = operation, children = operands Explicit temporaries and jumps
    Good for analysis and transformation Closer to machine code


    Tree-Based (AST) + a * b c Linear IR (Three-Address Code) t1 = b * c t2 = a + t1 // result in t2 ↑ Sequential, explicit temps





A Simple Compiler

  1. About It