Back to blog

Testing every model I could find to make an autonomous scraping agent work

·Dan Castrillo

the idea is simple. the implementation is not.

the endpoint i'm building takes a prompt, some optional seed URLs, and a JSON schema. it should figure out what to scrape, scrape it, extract what it needs, decide if it has enough, and either finish or go search for more. autonomous. no human in the loop.

the hard part is not the scraping. rendering a page and getting markdown back is solved. the hard part is the agent loop — the model that decides what to do next, calls the right tools, fills the schema, and knows when to stop.

i've been running this loop against every model i can get my hands on through OpenRouter. this post is what i found.

what the agent needs to do

the loop looks like this:

  1. if there are seed URLs, scrape them first
  2. look at what was extracted, compare against the schema
  3. decide: is the schema satisfied? if yes, finish. if no, what's missing?
  4. search the web for what's missing
  5. scrape the most relevant results
  6. extract again, merge with what we have
  7. go to step 3

the model has three tools available: search(query), scrape(urls[]), and finish(data). it also gets a system prompt explaining the schema, what's been extracted so far, and how many iterations it has left.

this is not a complicated tool setup. three tools, clear instructions, a structured schema to fill. and yet getting a model to do this reliably — without hallucinating, without looping forever, without giving up too early — is genuinely hard.

the eval setup

before testing models i needed a way to measure "did this work." i built a simple harness: 20 test cases, each with a prompt, optional seeds, and a known-good expected output. the output is a schema with N fields. i score each run on:

  • field coverage — what percentage of schema fields are populated
  • field accuracy — are the values actually correct (manual spot-check for the first round, regex patterns for structured fields like phone numbers and URLs)
  • hallucination rate — values in the output that don't appear in any scraped page
  • iterations used — lower is better, up to the limit of 10
  • cost — USD per completed job

most of my test cases are real-world prompts: find the contact details for restaurants in a city, extract product specs from e-commerce pages, get team members from company about pages. the kind of thing a user would actually run.

i ran each model 3 times per test case to account for variance and averaged the results.

model by model

gpt-4o

field coverage was excellent — 94%. accuracy was good. but it has a consistent problem: it over-scrapes. give it 10 iterations and it will use 9 of them even when the schema is fully satisfied after 3. i watched it scrape a 7th URL to "verify" information it had already correctly extracted from the first two. the cost per job was $0.38 on average, which killed the economics for anything high-volume.

the deeper issue is that gpt-4o is too eager to keep working. it treats finish() as something to call only when it's completely certain, and its threshold for certainty is very high. this is probably a good property in a coding assistant. it's a bad property in something that's supposed to stop when the schema is full.

gpt-4o mini

cheaper — average $0.07 per job. but the hallucination rate was the worst i saw: 23% of extracted values couldn't be traced back to any scraped content. it was making things up, confidently, in the right format. a phone number that matched the regex but didn't exist on any page. an address that looked plausible but was wrong. structured hallucination is worse than obvious hallucination because it's hard to catch downstream.

tool calling was also unreliable. it would sometimes call scrape() with a single string instead of an array of URLs, which caused type errors i had to add special handling for. i don't want my infrastructure working around a model's bad habits.

claude 3.5 haiku

fast and cheap ($0.04/job). field coverage was 81%, which sounds ok but the miss pattern was bad — it consistently failed on fields that required chaining two searches. the first search would give it something, it would extract partial data, and then instead of searching for the missing fields specifically, it would call finish() with the partial result. it gave up too early and too confidently.

i tried prompt engineering this away. system prompt variations, explicit instructions to check every field before calling finish. the early-termination problem got better but didn't go away. there's something about haiku's decision boundary on "i have enough" that's tuned too aggressively toward finishing.

claude 3.5 sonnet

the best accuracy i saw — 96% field coverage, lowest hallucination rate (4%), good tool calling. the problem is cost: $0.31/job. similar story to gpt-4o. at that price point the economics of building a product on top of it are very uncomfortable.

it's also slower. latency per iteration was high enough that a 5-iteration job felt sluggish. for a background async job that's tolerable. for anything that wants to feel fast, it's a problem.

gemini 1.5 flash

the context window here is genuinely useful — feeding it the full markdown of 5 scraped pages at once is no problem. field coverage was 88%. but tool calling was inconsistent in a specific way: it would sometimes return tool calls with extra keys that weren't in the schema, which broke my response parser. i added stripping logic. then it started returning tool calls nested one level deeper than expected. i added unwrapping logic. there's a version of this where you spend more time normalizing model output than building the actual product.

the hallucination rate was also higher than i wanted (17%) on extraction tasks. not as bad as gpt-4o mini, but bad enough that i didn't trust the output without a confidence field.

mistral small

i wanted a cheap open-weight option. mistral small via OpenRouter was $0.02/job. field coverage was 68%, which i could have lived with for the price. but the multi-step reasoning was the real problem. it couldn't reliably track what had already been found across iterations — by iteration 4 or 5 it would search for things it had already successfully extracted in iteration 2. the context was there in the prompt. it just wasn't using it.

i tried summarizing the "found so far" state into a more compact format to see if a shorter context would help. slightly better but not enough. this feels like a fundamental capability gap, not a prompting issue.

minimax m2.5

i ended up here almost by accident — someone in a Discord mentioned it for tool use tasks. it's a mixture-of-experts model running through OpenRouter at $0.03/job.

field coverage: 91%. hallucination rate: 6%. iterations to completion: average 3.4 out of 10. tool calling: correct format every time across all runs.

what surprised me was the stopping behavior. it calls finish() when the schema is actually satisfied, not before and not after. it seems to have a well-calibrated sense of "i have what was asked for." gpt-4o over-searches. haiku under-searches. minimax just... stops at the right time.

it's not perfect. on complex prompts with ambiguous schemas it sometimes fills fields with plausible-but-wrong values. but it's the best tradeoff i found between cost, accuracy, and behavior.

the failure modes that surprised me

i expected accuracy differences between models. what i didn't expect:

schema drift — some models would fill the schema correctly for the first 3 fields, then subtly change the key names on fields 4 and 5. phoneNumber becomes phone. websiteUrl becomes website. not wrong exactly, but not what the schema asked for. gpt-4o mini did this constantly.

tool call echo — a few models would call search() with the exact query i gave them in the prompt, then search() again with the same query, then search() a third time. no variation. if the first search didn't get what they needed, trying it again wasn't going to help. this looks like a failure mode where the model can identify what it needs but doesn't know how to rephrase a search that isn't working.

finish-then-continue — i saw this in gpt-4o a few times: it would call finish() with a complete result, and then in the same response also call search(). two tool calls in one response, one of which terminates the loop. i had to add logic to treat any response containing finish() as terminal regardless of other tool calls.

context poisoning — when an early scrape returned a long, noisy page, some models would fixate on information from that page even when later scrapes had better data. the first thing in context exerts disproportionate influence on the extraction. i partially worked around this by summarizing and deduplicating before appending to context, but it's still a real problem on messy pages.

what actually matters for this use case

after all of this, the properties i care about in order:

  1. tool call format reliability — if i have to write defensive normalization code for a model's output, that model is expensive in a way that doesn't show up in the price per token
  2. stopping behavior — a model that loops to the iteration limit on every job costs 3× what it should
  3. hallucination rate — wrong data that looks right is worse than missing data
  4. cost — only after the above

accuracy and field coverage matter but they're table stakes. the behavioral properties are what differentiate models for agentic use cases, and they're not what the benchmarks measure.

where i am now

minimax m2.5 is running in production. the agent endpoint works well enough that i'm building things on top of it. average cost is under $0.05 per job. average iterations is 3–4. field coverage on my test suite is 91%.

i'm keeping the model configurable — it's an environment variable, not hardcoded. the models will keep improving and i expect to swap this out in a few months. the eval harness i built is the thing that's actually valuable here. i can test a new model against the same 20 cases in about 10 minutes and see exactly where it falls down.

if you're building something similar, the main thing i'd tell you is to build the eval harness before you start model-shopping. without concrete pass/fail criteria, you'll spend a week convincing yourself that whatever model you're currently testing is working fine.

Related Posts