Writing a Bytecode VM for PEG Parsing
Back to blog

Writing a Bytecode VM for PEG Parsing

·Dan Castrillo

why not just recursive descent

the first version of Pegasus was recursive descent. every grammar rule was a function. parse_sequence calls parse_element which calls parse_atom which calls parse_char_class and so on. it worked fine. but every alternative in a choice meant a function call, and every function call meant pushing a stack frame, saving registers, the whole dance. for a grammar with 190 rules parsing real JavaScript files, that overhead adds up fast.

a bytecode VM compiles each grammar rule to a flat array of instructions instead of a function. run those instructions in a loop. one switch statement, no function call overhead, and the key part: backtracking becomes an explicit operation on a stack you control instead of something implicit in the call stack.

the instruction set

Pegasus has 34 opcodes. they break down into a few categories:

matching: Char, String, Any, Char_Class. these compare the current input position against something and either advance or fail.

control flow: Jump, Call, Return. these move the program counter around. Call is how grammar rules invoke other rules.

backtracking: Choice, Commit, Fail. this is where it gets interesting.

captures: Capture_Begin, Capture_End. these mark regions of input that matched a rule so you can extract them later.

Opcode :: enum u8 {
    Char, String, Any, Char_Class,
    Jump, Call, Return,
    Choice, Commit, Fail,
    Capture_Begin, Capture_End,
    // ... 22 more
}

the remaining 22 cover things like lookahead, character ranges, repetition hints, and some optimized special cases. but the core logic lives in those first 12.

the backtracking stack

in recursive descent, backtracking means unwinding the call stack. you return from functions, restore state, try the next alternative. the language runtime manages your backtracking state.

in the VM, backtracking is explicit. a stack holds saved positions. when a Choice instruction executes, it pushes the current input position and an alternative program counter onto the stack. if the current branch succeeds, Commit pops the entry. if it fails, Fail pops the entry and jumps to the alternative.

the entire backtracking mechanism is push, pop, jump.

the vm loop

the core execution loop is a for with a switch:

vm_execute :: proc(program: []Instruction, input: []u8) -> (captures: []Capture, ok: bool) {
    pc := 0
    pos := 0
    stack: [dynamic]Stack_Entry
 
    for pc < len(program) {
        inst := program[pc]
        switch inst.op {
        case .Char:
            if pos < len(input) && input[pos] == inst.byte {
                pos += 1
                pc += 1
            } else {
                pc = fail(&stack, &pos)
            }
        case .Choice:
            append(&stack, Stack_Entry{pc: inst.target, pos: pos})
            pc += 1
        case .Commit:
            pop(&stack)
            pc = inst.target
        case .Fail:
            pc = fail(&stack, &pos)
        case .Any:
            if pos < len(input) {
                pos += 1
                pc += 1
            } else {
                pc = fail(&stack, &pos)
            }
        case .Jump:
            pc = inst.target
        case .Call:
            append(&stack, Stack_Entry{pc: pc + 1, pos: -1})
            pc = inst.target
        case .Return:
            entry := pop(&stack)
            pc = entry.pc
        // ... remaining opcodes
        }
    }
 
    return captures[:], pc != FAIL_STATE
}

the fail procedure pops entries off the stack until it finds one with a saved position (a choice point). then it restores that position and jumps to the alternative branch. if the stack is empty, the whole parse fails.

what this buys you

benchmarks on the same ES2025 grammar parsing a 4KB JavaScript file. recursive descent: ~180us. bytecode VM: ~95us. roughly 2x faster, which tracks with what the academic papers on parsing machines predict.

the speed isn't the main win. the bytecode is data. you can serialize it, cache it, inspect it, optimize it. i can dump the bytecode for a grammar rule and see exactly what the parser will do. every branch, every backtrack, every capture is a visible instruction.

debugging got easier too. when a parse fails i can look at the program counter and the stack and know exactly where the parser was, what alternatives it tried, and where it gave up. with recursive descent you get a stack trace that mixes parser logic with language runtime noise.

Related Posts