the wrong question
the previous post was about finding the right model for the agent loop. minimax m2.5 won on cost and behavior. i shipped it and moved on.
but a few weeks later i kept bumping into the same ceiling. the agent has three distinct phases — planning, extraction, enrichment — and a general-purpose model handles all three with the same approach: read instructions, think out loud, produce output. that works. but it wastes a lot of tokens on reasoning that's irrelevant to the task, and it's inconsistent in ways that are hard to fix with prompting.
planning is a structured prediction problem. given a user prompt and a schema, produce a JSON plan with the right tool steps and a realistic target_count. that's it. it doesn't need to reason about philosophy. it just needs to reliably output valid JSON with the right fields.
extraction is a pattern matching problem. given a scraped page, pull out entities that match a schema. it needs to return an array, not an object. it needs to not hallucinate fields that aren't on the page.
enrichment is a lookup problem. given an entity with missing fields and some scraped content, fill in what's there, leave null what isn't. don't invent. don't over-fill.
three tasks. three different failure modes. one general model trying to do all of them.
so i decided to fine-tune.
first: the limit bands
before i could generate useful training data i had to get the agent to behave consistently. the agent's behavior is controlled by a set of constants in loop_runner.rs:
const MAX_ENRICHMENT_ITERATIONS: u32 = 3;
const MAX_DISCOVERY_ROUNDS: u32 = 3;
const MAX_ENRICHMENT_BATCH_SIZE: usize = 15;
const COVERAGE_THRESHOLD: f64 = 0.65;
const MAX_ENRICHMENT_ATTEMPTS_PER_ENTITY: u32 = 2;
const MAX_ADDITIONAL_QUERIES: usize = 20;
const MAX_CONTEXT_CHARS: usize = 60_000;i spent about three days just on these. not because they're complicated — they're just numbers — but because each one has a non-obvious effect on training data quality.
COVERAGE_THRESHOLD is the most important one. it controls when the enrichment phase stops: if 65% of schema fields across all entities are filled, don't bother doing more enrichment rounds. if i set it too high (say 0.90), the agent burns tokens trying to find fields that genuinely don't exist on the web — and those traces become bad training examples where the model learns to over-search. if i set it too low (0.40), the agent stops before it's done any useful enrichment, and the training examples show a model that gives up early.
0.65 is where the distribution of "good enough" examples was highest in my manual review.
MAX_DISCOVERY_ROUNDS has a circuit breaker on top of it: if two consecutive discovery rounds add zero new entities, stop regardless. without that, the agent would sometimes complete all 3 rounds even when the first round already found everything — generating redundant traces that teach the model to keep searching when the work is done.
MAX_CONTEXT_CHARS at 60,000 was a hard cut i had to add because some runs were accumulating context until the model started dropping earlier information. traces where the model loses track of what it already found are worse than no training data at all. i'd rather truncate and have a clean example than have a 120k-token trace where the model hallucinates in round 4.
the pattern across all of them: too loose produces traces that teach bad habits. too tight produces traces that are too short to be useful. the bands define what a "good run" looks like, and good runs are the only thing worth training on.
the distillation pipeline
once the agent was running within sensible limits i added trace logging. every LLM call writes a JSONL line:
{
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}the system and user messages are the inputs to each phase. the assistant message is what the agent actually produced. three task types, three separate JSONL files: planning.jsonl, extraction.jsonl, enrichment.jsonl.
the pipeline then uploads the datasets to Fireworks and kicks off a fine-tune job. i wrote upload.py to handle the Fireworks API specifics — their dataset format has a few quirks (field names with hyphens, an exampleCount field that needs to match exactly) that took a couple of tries to get right.
the eval script compares the fine-tuned model against a baseline. i picked gpt-4o-mini as baseline because it's cheap, fast, and performs well on all three tasks in my earlier testing. if the fine-tuned model can't beat gpt-4o-mini, there's no point shipping it.
the first eval
here are the actual results from the first run of eval.py, 50 held-out examples per task:
| task | metric | newt-1 | gpt-4o-mini |
|---|---|---|---|
| planning | valid_json | 0.0 | 1.0 |
| planning | required_fields | 0.0 | 1.0 |
| planning | has_tool_steps | 0.0 | 1.0 |
| extraction | valid_json | 0.0 | 1.0 |
| extraction | field_recall | 0.0 | 1.068 |
| extraction | entity_count_ratio | 0.0 | 0.833 |
| enrichment | valid_json | 0.0 | 1.0 |
| enrichment | field_recall | 0.0 | 0.497 |
| enrichment | null_rate | 1.0 | 0.483 |
newt-1 scores zero on everything. gpt-4o-mini wins cleanly.
this is not a good result but it's an honest one. the fine-tune clearly didn't converge, or there's a problem with how the model is being called (format mismatch, wrong API endpoint, something). i haven't diagnosed it yet. the null_rate: 1.0 for newt-1 on enrichment — meaning it returned null for every field — strongly suggests the model is either not responding or returning something the scorer can't parse.
the more interesting number to me is the baseline: gpt-4o-mini at field_recall: 0.497 on enrichment. that's the ceiling i'm working against. if i can get a much cheaper fine-tuned model to 0.5+ field recall on enrichment with lower null rate, the economics justify the effort.
what the eval revealed about the tasks
even with newt-1 failing completely, running the eval was useful because it forced me to define what "good" means for each task precisely.
for planning, it's binary: does it output valid JSON with the four required fields (entity_type, target_count, city, tool_steps)? gpt-4o-mini gets this right 100% of the time in my test set. the task is well-defined enough that any competent model should hit 1.0 here. if newt-1 can't match that, the fine-tune is broken at the most basic level.
for extraction, field_recall: 1.068 means gpt-4o-mini is actually returning more filled fields than the gold standard — it's filling things the gold label left null. that's not necessarily hallucination; it might mean the gold labels are incomplete. i need to audit a sample of those over-recalled fields before i trust the extraction metric.
for enrichment, field_recall: 0.497 with null_rate: 0.483 means gpt-4o-mini fills about half the fields and leaves the other half null. for enrichment that's a reasonable baseline — some fields just aren't findable on the scraped content. a good fine-tuned model should match that recall while reducing the null rate on fields that are present.
where this goes
the immediate next step is diagnosing why newt-1 returns nothing. my best guess is a format issue on the Fireworks inference side — the model might be trained on chat-format examples but being called with a completion-format prompt, or vice versa. once i fix that i'll re-run the eval and see what the actual fine-tune quality looks like.
if the numbers are still bad after fixing the format issue, i'll look at the training data itself. i probably need more examples, and the quality of the early traces (before i calibrated the limit bands) is suspect. there's a version of this where i need to retroactively filter the training set to only include runs that completed within budget and hit coverage threshold — which is exactly why i spent three days on those constants before generating data.
the goal is a model that's small, fast, and specialized. it doesn't need to be smart in a general sense. it needs to be a reliable JSON-outputting machine for three specific input/output pairs. that's a much more tractable fine-tuning target than general reasoning, and i think it's achievable. just not on the first try.