the chicken and the egg

so here's a fun problem i ran into early in building Pegasus. Pegasus is a PEG parser generator — you give it a grammar, it compiles that grammar into bytecode, and then you can run input through the bytecode VM to parse things. cool. but the grammar itself is text. PEG text. with rules and ordered choice operators and character classes and all that. which means to do anything at all, Pegasus needs to parse the grammar first.

but Pegasus is the parser. it generates parsers. so how does it parse its own input format?

this is a miniature version of the classic compiler bootstrap problem. the first C compiler couldn't be written in C because there was no C compiler yet. somebody had to write the first one in assembly, then use that to compile a C version of itself. every self-hosting compiler has this origin story — there's always a moment where you have to build the first version with your bare hands.

the hand-written meta-grammar

my solution was to skip the PEG text format entirely for the bootstrap layer. instead of writing the PEG grammar as PEG text and then somehow parsing it, i defined the grammar for PEG syntax directly as Odin data structures. no text parsing needed — it's just structs.

// The meta-grammar: hand-written Odin structs that define PEG syntax
meta_grammar := Grammar{
    rules = {
        "Grammar" = Sequence{One_Or_More{Rule_Ref{"Rule"}}},
        "Rule" = Sequence{Rule_Ref{"Identifier"}, Literal{"<-"}, Rule_Ref{"Expression"}},
        "Expression" = Sequence{Rule_Ref{"Sequence"}, Zero_Or_More{Sequence{Literal{"/"}, Rule_Ref{"Sequence"}}}},
        "Sequence" = One_Or_More{Rule_Ref{"Prefix"}},
        "Prefix" = Sequence{Optional{Rule_Ref{"Prefix_Op"}}, Rule_Ref{"Suffix"}},
        "Suffix" = Sequence{Rule_Ref{"Primary"}, Optional{Rule_Ref{"Suffix_Op"}}},
        "Primary" = Ordered_Choice{
            Rule_Ref{"Identifier"},
            Rule_Ref{"Literal"},
            Rule_Ref{"Char_Class"},
            Rule_Ref{"Dot"},
            Sequence{Literal{"("}, Rule_Ref{"Expression"}, Literal{")"}},
        },
        // quantifiers, lookahead operators, character classes...
    },
}

this is the meta-grammar. it knows what PEG syntax looks like — that a rule is an identifier followed by <- followed by an expression, that an expression is a sequence of alternatives separated by /, that primaries can be identifiers or literals or character classes or grouped sub-expressions. all the structure of PEG, encoded as typed Odin values instead of text.

the key insight is that this meta-grammar doesn't need to be parsed. it's already in memory as the right data structures. the Odin compiler handles turning these struct literals into the internal representation that Pegasus works with. so there's no parsing step for the bootstrap — you just have the grammar.

what happens when a user writes a grammar

so the flow looks like this. a user writes a PEG grammar as text:

expression <- term (("+" / "-") term)*
term       <- factor (("*" / "/") factor)*
factor     <- number / "(" expression ")"
number     <- [0-9]+

Pegasus takes this text and runs it through the meta-grammar. the meta-grammar knows how to recognize rule definitions, ordered choice with /, sequences, character classes like [0-9], quantifiers like + and *, and all the other PEG operators. it parses the text into an AST — a tree of grammar nodes representing the user's rules.

then the compiler takes that AST and generates bytecode. the bytecode is what actually runs when you parse input later. so there are really two parsers in play: the meta-grammar parser that reads PEG text, and the bytecode VM that runs the compiled grammar against actual input.

why not just use a parser library

i could have used an existing parser to read the PEG grammar format. grab some regex, write a recursive descent parser by hand, whatever. and honestly for something this simple that would have been fine. PEG syntax isn't that complicated.

but there's something satisfying about the self-contained approach. the meta-grammar is defined using the same types that user grammars get compiled into. it's Grammar, Sequence, Ordered_Choice, Rule_Ref — the same structs everywhere. there's no separate "grammar for reading grammars" system with its own types and its own parser. it's one system that bootstraps itself.

it also means the meta-grammar is testable in the same way user grammars are. i can run it through the same validation, the same bytecode compiler, the same VM. if something is broken in how Pegasus handles sequences or ordered choice, it'll show up in the meta-grammar too because the meta-grammar uses sequences and ordered choice.

the bootstrap is tiny

the whole meta-grammar is maybe 40 lines of Odin. that's the entire bootstrap. forty lines of hand-written struct literals and then Pegasus can parse any PEG grammar you throw at it, compile it to bytecode, and run it. everything after those 40 lines is self-sustaining.

i think that's what makes the bootstrap problem interesting in practice — the gap between "nothing" and "self-hosting" is usually smaller than you'd expect. you don't need to write a full compiler in assembly. you just need enough to get off the ground, and then the system can take over. for Pegasus, "enough" was a handful of Odin structs that describe what PEG looks like. the rest follows from there.