for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data.
Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page.
Quick clarification: the AI agent writes the config once and is out of the loop after that. You run crawls yourself or via cron. So the "auto-regenerate and silently get wrong data" scenario doesn't quite apply since there's no agent in the runtime loop.
But configs going stale is a real problem. Two things help:
1. The agent tests on 5 real pages before saving any config. Empty fields = rewrite before it hits production.
2. `./scrapai health --project <n>` tests all your spiders and flags extraction failures. We run it monthly via cron. Broken spider? Point the agent at it, it re-analyzes and fixes.
The gap: result count drops (your 500 to 450 example). Health checks catch broken extraction, not "fewer pages matched." We list structural change detection as an open contribution area in the README.
how do you combat silent failures?
for example, I am scraping website A, getting 500+ pdf files; then they change their layout, the ETL breaks, we autoregenerate it with Claude, but then we get only 450 PDFs. The orchestrator still marks it as a successful run, but we get only part of the data.
Or: the ETL for website B breaks. We use our agentic solution, we successfully repair it, and it completes without errors, but we start missing a few fields that were moved in another sub-page.
Did you encounter any such issues?