Self-Hosting a Parser: The Bootstrap Problem
Back to blog

Self-Hosting a Parser: The Bootstrap Problem

·Dan Castrillo

the chicken and the egg

Pegasus is a PEG parser generator. you give it a grammar, it compiles that grammar into bytecode, and then you run input through the bytecode VM to parse things. but the grammar itself is text. PEG text. with rules and ordered choice operators and character classes and all that. to do anything at all, Pegasus needs to parse the grammar first.

but Pegasus is the parser. it generates parsers. so how does it parse its own input format?

this is a miniature version of the classic compiler bootstrap problem. the first C compiler couldn't be written in C because there was no C compiler yet. somebody had to write the first one in assembly, then use that to compile a C version of itself. every self-hosting compiler has this origin story: there's always a moment where you have to build the first version with your bare hands.

the hand-written meta-grammar

i skipped the PEG text format entirely for the bootstrap layer. instead of writing the PEG grammar as PEG text and then somehow parsing it, i defined the grammar for PEG syntax directly as Odin data structures. no text parsing needed. just structs.

// The meta-grammar: hand-written Odin structs that define PEG syntax
meta_grammar := Grammar{
    rules = {
        "Grammar" = Sequence{One_Or_More{Rule_Ref{"Rule"}}},
        "Rule" = Sequence{Rule_Ref{"Identifier"}, Literal{"<-"}, Rule_Ref{"Expression"}},
        "Expression" = Sequence{Rule_Ref{"Sequence"}, Zero_Or_More{Sequence{Literal{"/"}, Rule_Ref{"Sequence"}}}},
        "Sequence" = One_Or_More{Rule_Ref{"Prefix"}},
        "Prefix" = Sequence{Optional{Rule_Ref{"Prefix_Op"}}, Rule_Ref{"Suffix"}},
        "Suffix" = Sequence{Rule_Ref{"Primary"}, Optional{Rule_Ref{"Suffix_Op"}}},
        "Primary" = Ordered_Choice{
            Rule_Ref{"Identifier"},
            Rule_Ref{"Literal"},
            Rule_Ref{"Char_Class"},
            Rule_Ref{"Dot"},
            Sequence{Literal{"("}, Rule_Ref{"Expression"}, Literal{")"}},
        },
        // quantifiers, lookahead operators, character classes...
    },
}

the meta-grammar defines what PEG syntax looks like. a rule is an identifier followed by <- followed by an expression. an expression is a sequence of alternatives separated by /. primaries can be identifiers or literals or character classes or grouped sub-expressions. all the structure of PEG, encoded as typed Odin values instead of text.

this meta-grammar doesn't need to be parsed. it's already in memory as the right data structures. the Odin compiler turns these struct literals into the internal representation that Pegasus works with. no parsing step for the bootstrap. you just have the grammar.

what happens when a user writes a grammar

a user writes a PEG grammar as text:

expression <- term (("+" / "-") term)*
term       <- factor (("*" / "/") factor)*
factor     <- number / "(" expression ")"
number     <- [0-9]+

Pegasus takes this text and runs it through the meta-grammar. the meta-grammar knows how to recognize rule definitions, ordered choice with /, sequences, character classes like [0-9], quantifiers like + and *, and all the other PEG operators. it parses the text into an AST: a tree of grammar nodes representing the user's rules.

then the compiler takes that AST and generates bytecode. the bytecode is what actually runs when you parse input later. so there are really two parsers in play: the meta-grammar parser that reads PEG text, and the bytecode VM that runs the compiled grammar against actual input.

why not just use a parser library

i could have used an existing parser to read the PEG grammar format. write a recursive descent parser by hand, grab some regex. for something this simple that would have been fine. PEG syntax isn't that complicated.

but the self-contained approach has a property i like. the meta-grammar uses the same types that user grammars get compiled into. Grammar, Sequence, Ordered_Choice, Rule_Ref. the same structs everywhere. no separate "grammar for reading grammars" system with its own types and its own parser. one system that bootstraps itself.

it also means the meta-grammar is testable in the same way user grammars are. i can run it through the same validation, the same bytecode compiler, the same VM. if something is broken in how Pegasus handles sequences or ordered choice, it'll show up in the meta-grammar too because the meta-grammar uses sequences and ordered choice.

the bootstrap is tiny

the whole meta-grammar is maybe 40 lines of Odin. that's the entire bootstrap. forty lines of hand-written struct literals and then Pegasus can parse any PEG grammar you throw at it, compile it to bytecode, and run it. everything after those 40 lines is self-sustaining.

the gap between "nothing" and "self-hosting" is smaller than you'd expect. you don't need to write a full compiler in assembly. you need enough to get off the ground, and then the system takes over. for Pegasus, "enough" was a handful of Odin structs that describe what PEG looks like.

Related Posts