11pm, same failures

i've adjusted this prompt sixteen times. the model still returns null for half the fields on pages longer than three screens. i fixed that two days ago. now it's back.

this is how prompt tuning works: you fix the cases that annoyed you last, which means you probably broke something you fixed two iterations ago. you can't hold all the failure modes in your head at once. the prompt gets better in one place and worse in another, and you have no way to know which changes helped.

the problem isn't the prompt. it's the loop. a human reading failures and generating hypotheses is a slow, stateful, lossy process.

you are the bottleneck.

the faster loop

the optimizer is a large model running through an API. the candidate is the small local model you're trying to improve.

the loop:

run the candidate on an eval set, score each output
collect the worst-performing examples
send them to the optimizer: "here's the current prompt, here's what the model produced, here's what it should have produced. rewrite the prompt."
test the new prompt. if it scores higher, keep it. if not, revert.
repeat.

the optimizer never sees the candidate's internals. it only sees failures. it's doing exactly what you were doing at 11pm. reading failures and rewriting the prompt. except it runs one iteration in seconds instead of twenty minutes.

that's the reframe: you're not building something clever. you're replacing yourself with a faster version of the same process.

the speed problem

the first version was too slow to matter. each iteration meant waiting 54.8 seconds for the candidate to process the eval set. at that speed you get maybe 10 iterations before you lose patience. roughly where you were with manual tuning.

switching the candidate to MLX brought it to 0.9 seconds. 58x. now 10 iterations takes 9 seconds. you can run 300 before the optimizer API cost becomes the real bottleneck.

the insight isn't "MLX is fast." the loop only works if it's faster than your patience. below that threshold it's a slightly automated version of what you were already doing. above it, something different happens. you let it run overnight and come back to a prompt that's measurably better.

what breaks

MLX OOMs on long inputs. past a certain context length it slows to a crawl or dies. i switched those tasks back to ollama and kept MLX for the shorter inputs. the loop routes by estimated input length now.

a separate failure: the candidate returned JSON wrapped in markdown fences, partial objects, sometimes nothing. the scorer handled all of these, which made the eval results noisy in a bad way. some failures were model failures, some were formatting failures. the optimizer can't tell the difference.

the fix was ollama's format field with a JSON schema. ollama converts it to a GBNF grammar and masks invalid tokens during sampling. the model can't emit a token that would produce invalid JSON. not post-processing. a constraint on the generation itself.

formatting failures went to zero. the remaining failures are semantic. wrong values, wrong nulls. those are the ones worth fixing.

what the loop can't fix

bad eval data. if your held-out examples don't cover the failure modes you care about, the optimizer optimizes for things that don't matter. i spent more time building the eval set than building the loop. that's still the right ratio.

and regression detection is non-negotiable. each new prompt gets tested against a held-out test set, not just the optimization set. a prompt that scores higher on the optimization set and lower on the test set isn't better. it's overfit.

back to 11pm

the loop ran overnight. in the morning there was a prompt that scored 23 points higher on the eval set than the one i'd been tuning by hand for a week.

i didn't write it. i don't fully understand why it works. the optimizer made structural changes. reordered sections, added examples, changed how the output format was described. changes i wouldn't have tried because they didn't match my mental model of what was failing.

the optimizer isn't constrained by your hypotheses. it tries things you wouldn't try. some of them work.

the bottleneck in prompt engineering is the engineer

11pm, same failures

the faster loop

the speed problem

what breaks

what the loop can't fix

back to 11pm

Related Posts

Fine-tuning my own model for the scraping agent — first attempt

Testing every model I could find to make an autonomous scraping agent work

Building a SaaS in Go without a single React component