Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Ironically, 'source checking' is something AI is quite good at.


There's nuance to that. An LLM is quite capable of suggesting relevant reading, given the context. Especially when the context is broad enough that there's enough training data.

"Find me research on code reviews, their size, and quality" would give you more than enough reading. Yet, if you start with a claim, like "Longer PRs mean worse defect detection," the relevant data points fall to few enough for AI to start hallucinating.

You get "something, something, PR length, defect detection, IDK, I don't read research papers." Such output is fine as long as the author cares to validate it.

Skip the second step, and you might be good if you ask about something generic, like "What's the Slack story?" or "How did Blockbuster go bust?" Ask about some specific details, though, and you're bound to end up with made-up stuff that sounds just about right, while it's actually wrong.


Checking is different from finding, though. Source checking means just "verify that this information is actually present in that document". Much harder to hallucinate in this case.


A quick smoke check takes just a few minutes.

"Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why"

If it returns a report claiming all correct? That's promising, but human verification is important. You've got a list of hyperlinks, and a list of claims; so you can click each with middle-mouse, Ctrl-F 'till you find the point, and close the tab when you do.

If you find any discrepancies ? Your initial prompt was malformed and/or you picked the wrong LLM, the wrong human, or possibly all three. Whatever the way, the results are built on quicksand; you'll need to start over.

If no sources are provided? Well now: "If there ain't no sources it never happened."

Compare double-entry bookkeeping. It needs to all add up. If you're 1 cent off, that means something is broken. Idem if a single reference is off, it polluted the context. (This works for human-generated and hybrid documents too. Polluted reasoning is polluted reasoning. The process is what counts.)


A quick smoke test, then. Gemini 3, Thinking Mode. The article: https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... The prompt: literally what you suggested.

Gemini: The article focuses on the environmental and human labor costs of scaling Artificial Intelligence, specifically focusing on water usage, electricity, and "ghost work."

Which is hilarious, since the article doesn't even mention the words "water" or "electricity." Gemini remains unfazed, reporting the links that are not in the article (some don't exist at all) to make the final ruling: "The Tech Trenches document is highly accurate in its citations."

Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?


Ah! I finally got you somewhat replicated! It's https://gemini.google.com , when you use the free model.

* https://gemini.google.com/share/6bd33176b27c

Right, so https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... is actually a substack, gemini is blocked from accessing it, and is bouncing off and hallucinating instead. Ok, that's an actual bug, that should not lead to the model starting to hallucinate. Imo the correct response should have been to fail loudly; which would have been a verification signal of its own.

ps: See also: https://hackertimes.com/item?id=48087485 ... I'm starting to think of it as "english is a new scripting language". Clearly the downside is that certain "runtime environments" are not compatible. %-/


https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... "Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why. If unable to fetch the initial document, Stop and report failure."

And now it errors out on gemini.google.com. . This is like early days unix scripting; I didn't add the equivalent of "#!/bin/bash -euo pipefail" ; and I didn't catch it because most systems already include something like it in their ".bashrc" (system prompt or weights) anyway.

This is so frustrating. I'm sorry. It's like the 1980's 8 bit era again, some systems actually work, others are terrible, and I didn't realize it can be like this for some folks. You could come away with the conclusion that this whole "computer" thing is all just a fad that'll never amount to anything. (meanwhile , the program works perfectly on my own machine, right over here of course %-) )


> Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?

Wait. Why do I suddenly suspect you were on to me this whole time?

Very Well. Here's a skill that does the thing; you tell me: https://vps.kimbruning.nl/link-verifier.skill

While building, I realized I could actually make the whole thing a lot better, and really dig into sources. But... it's a start.

+ Output on your url. Ugly, but works: https://claude.ai/public/artifacts/d465a07b-378c-4089-b885-6...


Gemini is famously bad at these things. Try using ChatGPT.


Interesting! Where did you apply it? Can you show your output in more detail?

It's more like a small script, and it's supposed to extract urls and generate a table.

Here's my result in Claude Web for comparison:

https://claude.ai/public/artifacts/d76936f2-c97b-4bff-9205-2...

Claude web finds a number of small discrepancies in the sources, which I manually crosschecked and seem consistent with a human mixing things up slightly.

+ I also tested in gemini 3 flash preview, which generates an actual table (twice). It doesn't flag any discrepancies, which is consistent with it being a weaker model. But the urls and claims are listed and line up, so you've got your verification table to work with. (it's a semantic formatting task, so that part would be hard to mess up)

+ Gemini 3.1 pro yields a fairly aggressive report. https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

+ ChatGPT free (specific model not listed) needed 2 tries, didn't properly follow the prompt even then. I guess I got what I paid for, and I needed to download; https://vps.kimbruning.nl/productivity/Ai%20Productivity%20A... (pdf), https://vps.kimbruning.nl/productivity/ai_productivity_artic... (md)

+ Kimi K2.6 instant: https://www.kimi.com/share/19e2cc40-d012-89bf-8000-00006267f...

+ Summary of results. https://claude.ai/public/artifacts/10a42111-a0ee-42f3-b6d2-a... All of the models extracted the URLs into a table just fine, and that part at least is a lot easier than writing a perl script used to be in the '90s ;-) . The first part is the important bit so you as a human can "check your fucking sources". The second part the models handle variously, each does find discrepancies. None of them find all of them, but that makes sense: this is a fairly polished piece and it ideally shouldn't have discrepancies at all to begin with.

So: it worked as a smoke check just fine in the above. Doing more than a quick smoke check obviously requires a somewhat more involved procedure.


I would love to do it at scale on many online publications, and publish the results. That would teach 'em.


Have we forgotten how bad LLMs were at citing sources when they first came out? So, we had to build a lot of structure (harness engineering) and frontier labs had to do specific training to try to compensate for this.

So, LLMs are inherently bad at citing sources. A lot of effort has been put in to improve this behavior, but it's compensating for an inherent flaw.


Huh? Oh! Were they still treating the LLM as an "oracle box"/online chatbot at the time? (as opposed to a more agentic workflow?)

If they weren't, ignore I said the following, and please tell me what else was going wrong (and with what models and harnesses!).

Models weights are like Wikipedia. A nice starting point, but should never be referenced directly. You need to have your agent actually go out onto the internet and do the research. Now the actual references will be in your agent's actual Context (memory), so then it'd at least be rather more surprising if they don't cite correctly.

I do realize there's still corner cases even in the best setups though; So a final crosscheck sweep is never not a good idea.


I mostly disagree with this. You can request sources, you can ask it to check, but no LLM I have used can do this correctly more than 50-75% of the time, and some of the major models are extremely bad at this: giving broken links 90% of the time, incapable of giving actual links rather than search engine links, etc. Constant supervision and repetition of requests can sometimes get results, but it is exhausting. The "sources" it finds are often Reddit posts or other questionable secondary or tertiary sources, not actual original sources.


I disagree. It is a bullshit machine all the way to the core. LLMs in my world fail to cite full sources and consistently conclude with guesses as facts. It does this much more than an average journalist or reporter would. Only when you double-check it will it then apologize and correct itself.


Judging by the number of scientific papers that have been outed as AI-generated, precisely because it hallucinated sources, it's not


Citation needed, please


Personal experience? You ask it for the name of the paper referenced. You google that paper (for some reason it's not great at going out and acquiring the paper). You then upload the pdf and ask it if the paper supports the assertion if it's not quickly findable via ^F. You go read, ask it clarifying questions about hazard ratios, what they controlled for, etc.

AI is quite good when grounded in a source.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: