the problem with arrows
so i've been building a javascript parser using PEG (parsing expression grammars) in Odin, and i hit this wall that i think anyone writing a JS parser eventually runs into: arrow functions are ambiguous.
when the parser sees (a, b, c), it has no idea whether that's a parenthesized expression — like (a, b, c) as a comma expression — or the parameter list for an arrow function like (a, b, c) => a + b + c. the only way to know is to look past the closing paren and check if => follows.
this is annoying in any parser but it's especially painful in a PEG parser because of how ordered choice works.
how PEG ordered choice bites you
in a PEG grammar you write alternatives with / (ordered choice). so an expression rule might look something like:
expression <- arrow_function / assignment / ternary / ...
the key thing about PEG is that it tries each alternative in order and commits to the first one that succeeds. if arrow_function starts matching — it sees (, starts parsing parameters, gets through the whole list — but then there's no =>, it fails. the parser backtracks all the way to the start and tries assignment next.
that backtracking is the killer. for something like (a, b, c, d, e, f, g) as a plain expression, the parser will:
- enter
arrow_function, match(, parse all seven identifiers as parameters - hit
), look for=>, fail - backtrack to the beginning
- enter the next alternative, re-parse the entire
(a, b, c, d, e, f, g)as an expression
you're parsing the same tokens twice. and in real javascript code, most parenthesized expressions are not arrow functions. so you're paying this backtracking cost over and over for the common case.
the lookahead trick
the fix is surprisingly simple. before even attempting the arrow function rule, do a quick linear scan: find the matching ) by counting balanced parens, then check if => follows. if it doesn't, skip the arrow function rule entirely.
scan_arrow :: proc(input: []u8, pos: int) -> bool {
depth := 0
i := pos
for i < len(input) {
switch input[i] {
case '(': depth += 1
case ')': depth -= 1
if depth == 0 {
// skip whitespace after )
j := i + 1
for j < len(input) && (input[j] == ' ' || input[j] == '\t' || input[j] == '\n') {
j += 1
}
// check for =>
return j + 1 < len(input) && input[j] == '=' && input[j + 1] == '>'
}
}
i += 1
}
return false
}this scan is dirt cheap. it doesn't allocate, doesn't build any AST nodes, doesn't even care about what's inside the parens. it just counts ( and ) until they balance, then peeks ahead. the whole thing is O(n) in the size of the parenthesized content, which we'd have to traverse anyway.
then in the expression rule, you gate the arrow function alternative behind this check:
parse_expression :: proc(p: ^Parser) -> ^Node {
if p.input[p.pos] == '(' && scan_arrow(p.input, p.pos) {
return parse_arrow_function(p)
}
// ... other expression alternatives
return parse_assignment_expression(p)
}now when the parser sees (a, b, c) and there's no => after it, it doesn't even attempt the arrow function path. zero backtracking. it goes straight to parsing it as a parenthesized expression.
the results
i benchmarked this against a ~15k line javascript bundle (a minified react app) and the difference was pretty dramatic. without the lookahead, parsing took around 4.2ms. with it, 2.6ms. that's a 38% speedup just from avoiding failed arrow function attempts.
the reason it's so significant is that parenthesized expressions are everywhere in javascript. every function call has parens. every grouped expression. every if condition. and for each one, the old parser was speculatively trying the arrow function rule first, failing, and backtracking.
why this works in PEG specifically
this trick is particularly relevant to PEG parsers because PEG doesn't have separate lexing and parsing phases — or at least mine doesn't. in a traditional parser with a separate lexer, you might handle this differently, maybe with a cover grammar or by doing a two-pass approach where you parse an ambiguous CoverParenthesizedExpressionAndArrowParameterList and refine it later. that's what the ECMAScript spec actually suggests.
but in a PEG parser where you're pattern-matching directly on the input, the lookahead scan is cleaner. you're already working at the character level, so scanning for balanced parens is natural. and it preserves the nice property of PEG grammars where each rule either succeeds or fails without side effects — the scan doesn't modify any parser state.
i think there's a general lesson here: when you have an ordered choice where one alternative is expensive to fail, see if you can add a cheap guard that eliminates it early. the scan doesn't need to be perfect — it just needs to be fast and correct enough to avoid the expensive path in the common case.