the idea is simple. the implementation is not.

the endpoint i'm building takes a prompt, some optional seed URLs, and a JSON schema. it should figure out what to scrape, scrape it, extract what it needs, decide if it has enough, and either finish or go search for more. autonomous. no human in the loop.

the hard part is not the scraping. rendering a page and getting markdown back is solved. the hard part is the agent loop: the model that decides what to do next, calls the right tools, fills the schema, and knows when to stop.

i've been running this loop against every model i can get my hands on through OpenRouter.

what the agent needs to do

the loop:

if there are seed URLs, scrape them first
look at what was extracted, compare against the schema
decide: is the schema satisfied? if yes, finish. if no, what's missing?
search the web for what's missing
scrape the most relevant results
extract again, merge with what we have
go to step 3

the model has three tools available: search(query), scrape(urls[]), and finish(data). it also gets a system prompt explaining the schema, what's been extracted so far, and how many iterations it has left.

this is not a complicated tool setup. three tools, clear instructions, a structured schema to fill. and yet getting a model to do this reliably (without hallucinating, without looping forever, without giving up too early) is hard.

the eval setup

i built a harness: 20 test cases, each with a prompt, optional seeds, and a known-good expected output. the output is a schema with N fields. i score each run on:

field coverage: what percentage of schema fields are populated
field accuracy: are the values actually correct (manual spot-check for the first round, regex patterns for structured fields like phone numbers and URLs)
hallucination rate: values in the output that don't appear in any scraped page
iterations used: lower is better, up to the limit of 10
cost: USD per completed job

most test cases are real-world prompts: find contact details for restaurants in a city, extract product specs from e-commerce pages, get team members from company about pages.

i ran each model 3 times per test case to account for variance and averaged the results.

model by model

gpt-4o

field coverage: 94%. accuracy was good. but it has a consistent problem: it over-scrapes. give it 10 iterations and it will use 9 of them even when the schema is fully satisfied after 3. i watched it scrape a 7th URL to "verify" information it already had from the first two. the cost per job averaged $0.38, which killed the economics for anything high-volume.

the deeper issue is that gpt-4o is too eager to keep working. it treats finish() as something to call only when it's completely certain, and its threshold for certainty is very high. this is probably a good property in a coding assistant. it's a bad property in something that's supposed to stop when the schema is full.

gpt-4o mini

cheaper at $0.07 per job. but the hallucination rate was the worst i saw: 23% of extracted values couldn't be traced back to any scraped content. it made things up, confidently, in the right format. a phone number that matched the regex but didn't exist on any page. an address that looked plausible but was wrong. structured hallucination is worse than obvious hallucination because it's hard to catch downstream.

tool calling was unreliable. it would call scrape() with a single string instead of an array of URLs, causing type errors i had to add special handling for. i don't want my infrastructure working around a model's bad habits.

claude 3.5 haiku

fast and cheap ($0.04/job). field coverage was 81%, which sounds ok but the miss pattern was bad: it consistently failed on fields that required chaining two searches. the first search would give it something, it would extract partial data, and then instead of searching for the missing fields specifically, it would call finish() with the partial result. it gave up too early and too confidently.

i tried prompt engineering this away. system prompt variations, explicit instructions to check every field before calling finish. the early-termination problem got better but didn't go away. there's something about haiku's decision boundary on "i have enough" that's tuned too aggressively toward finishing.

claude 3.5 sonnet

the best accuracy i saw: 96% field coverage, lowest hallucination rate (4%), good tool calling. the problem is cost: $0.31/job. similar story to gpt-4o. at that price point the economics of building a product on top of it are very uncomfortable.

it's also slower. latency per iteration was high enough that a 5-iteration job felt sluggish. for a background async job that's tolerable. for anything that wants to feel fast, it's a problem.

gemini 1.5 flash

the context window is useful here: feeding it the full markdown of 5 scraped pages at once is no problem. field coverage was 88%. but tool calling was inconsistent in a specific way: it would sometimes return tool calls with extra keys that weren't in the schema, which broke my response parser. i added stripping logic. then it started returning tool calls nested one level deeper than expected. i added unwrapping logic. there's a version of this where you spend more time normalizing model output than building the actual product.

the hallucination rate was higher than i wanted (17%) on extraction tasks. not as bad as gpt-4o mini, but bad enough that i didn't trust the output without a confidence field.

mistral small

i wanted a cheap open-weight option. mistral small via OpenRouter was $0.02/job. field coverage was 68%, which i could have lived with for the price. but the multi-step reasoning was the real problem. it couldn't reliably track what had already been found across iterations: by iteration 4 or 5 it would search for things it had already successfully extracted in iteration 2. the context was there in the prompt. it just wasn't using it.

i tried summarizing the "found so far" state into a more compact format to see if a shorter context would help. slightly better but not enough. this feels like a fundamental capability gap, not a prompting issue.

minimax m2.5

i found this one by accident. someone in a Discord mentioned it for tool use tasks. it's a mixture-of-experts model running through OpenRouter at $0.03/job.

field coverage: 91%. hallucination rate: 6%. iterations to completion: average 3.4 out of 10. tool calling: correct format every time across all runs.

the stopping behavior stood out. it calls finish() when the schema is satisfied, not before and not after. gpt-4o over-searches. haiku under-searches. minimax stops at the right time.

it's not perfect. on complex prompts with ambiguous schemas it sometimes fills fields with plausible-but-wrong values. but it's the best tradeoff i found between cost, accuracy, and behavior.

the failure modes that surprised me

accuracy differences between models were expected. these were not:

schema drift: some models fill the schema correctly for the first 3 fields, then change the key names on fields 4 and 5. phoneNumber becomes phone. websiteUrl becomes website. not wrong, but not what the schema asked for. gpt-4o mini did this constantly.

tool call echo: a few models would call search() with the exact query i gave them in the prompt, then search() again with the same query, then search() a third time. no variation. if the first search didn't get what they needed, trying it again wasn't going to help. this looks like a failure mode where the model can identify what it needs but doesn't know how to rephrase a search that isn't working.

finish-then-continue: gpt-4o did this a few times. it would call finish() with a complete result, and then in the same response also call search(). two tool calls in one response, one of which terminates the loop. i added logic to treat any response containing finish() as terminal regardless of other tool calls.

context poisoning: when an early scrape returned a long, noisy page, some models would fixate on information from that page even when later scrapes had better data. the first thing in context exerts disproportionate influence on the extraction. i partially worked around this by summarizing and deduplicating before appending to context, but it's still a real problem on messy pages.

what actually matters for this use case

the properties i care about, in order:

tool call format reliability: if i have to write defensive normalization code for a model's output, that model is expensive in a way that doesn't show up in the price per token
stopping behavior: a model that loops to the iteration limit on every job costs 3× what it should
hallucination rate: wrong data that looks right is worse than missing data
cost: only after the above

accuracy and field coverage matter but they're table stakes. the behavioral properties are what differentiate models for agentic use cases, and they're not what the benchmarks measure.

where i am now

minimax m2.5 is running in production. average cost is under $0.05 per job. average iterations: 3 to 4. field coverage on my test suite is 91%.

the model is an environment variable, not hardcoded. the models will keep improving and i expect to swap this out in a few months. the eval harness is the thing that's actually valuable. i can test a new model against the same 20 cases in about 10 minutes and see exactly where it falls down.

build the eval harness before model-shopping. without concrete pass/fail criteria, you'll spend a week convincing yourself that whatever model you're testing is working fine.

Testing every model I could find to make an autonomous scraping agent work