More

flail · 2026-05-15T16:41:20 1778863280

That thought crossed my mind. However, for such a product to work, there would have to be a human in the loop. With data-starved edge cases, which are many in fact-checking landscape, it would be relatively easy for an LLM to make stuff up or mislabel the context (which it inherently does not understand).

Also, thorough validation would cost a ton in tokens. So it would be both expensive from the tech perspective (AI bills) and labor. Now, whose interest would be to fund such a product? I don't see too many takers...

flail · 2026-05-15T16:35:50 1778862950

Um, is Snopes wrong about city cleaning crows, though? As that was the context of the original post. Which, by the way, doesn't say "Go, trust Snopes with everything; they can't be wrong!"

tinfoilhatter · 2026-05-21T14:51:00 1779375060

The context of the original post was that people should check the sources they are sharing with others, and Snopes was suggested as a source worth sharing. I disagree that Snopes is a reliable source of information or has any interest in presenting a fair and balanced narrative, because of the conflict of interest created by who / what groups fund them. Same with Politifact, or any other fact-checking organization.

flail · 2026-05-15T16:30:06 1778862606

Fundamentally, yes, it is a different "search engine."

BTW, as critical as I can be to AI, using an argument that something didn't work 3 years ago, so it must be crap, doesn't work in this context. 3 years ago, AI could barely generate several lines of consistent code. Now, it generates working apps with a prompt (it's another discussion how good the code is, but still).

I guess 3 years ago, Gemini couldn't tell how many r's are in the word refrigerator.

Same for research. At some point, I switched from ChatGPT and Gemini to Perplexity as it promised AI-powered search. It worked visibly better. Until it didn't, as GPT and Gemini models made a leap.

Back to the point, as long as we understand that, for now, it's all just a probabilistic machine generating the most likely output, no one should expect bulletproof answers. Search was/is way more deterministic than LLMs.

andai · 2026-05-15T22:11:02 1778883062

> 3 years ago, AI could barely generate several lines of consistent code

Nah, not really. The same summer I made a self-modifying AI programmer. Or more precisely, GPT-4 made it. I copy pasted the classes from the chat one by one.

(I also told it what was wrong with the previous attempt at such a system, so it designed around that.)

That model was also scary good at deobfuscation.

I think it did have that issue about randomly saying the opposite of facts though (the fact but stated in the negative for some reason).

flail · 2026-05-15T15:03:48 1778857428

Ultimate credibility? Sure, they never did. Yet the whole thing Google was built upon was using links as tokens of credibility.

You'd assume an outgoing link from a CNN website has more credibility than one from an anonymous blog. That is, I reckon, still true. Although the credibility either link conveys is degrading. Again, it has been so since we started playing the game of SEO, yet AI-generated content in this context is basically a weapon of mass destruction. The deterioration has sped up dramatically.

flail · 2026-05-15T14:58:08 1778857088

There's nuance to that. An LLM is quite capable of suggesting relevant reading, given the context. Especially when the context is broad enough that there's enough training data.

"Find me research on code reviews, their size, and quality" would give you more than enough reading. Yet, if you start with a claim, like "Longer PRs mean worse defect detection," the relevant data points fall to few enough for AI to start hallucinating.

You get "something, something, PR length, defect detection, IDK, I don't read research papers." Such output is fine as long as the author cares to validate it.

Skip the second step, and you might be good if you ask about something generic, like "What's the Slack story?" or "How did Blockbuster go bust?" Ask about some specific details, though, and you're bound to end up with made-up stuff that sounds just about right, while it's actually wrong.

throw310822 · 2026-05-15T15:15:59 1778858159

Checking is different from finding, though. Source checking means just "verify that this information is actually present in that document". Much harder to hallucinate in this case.

Kim_Bruning · 2026-05-15T15:34:32 1778859272

A quick smoke check takes just a few minutes.

"Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why"

If it returns a report claiming all correct? That's promising, but human verification is important. You've got a list of hyperlinks, and a list of claims; so you can click each with middle-mouse, Ctrl-F 'till you find the point, and close the tab when you do.

If you find any discrepancies ? Your initial prompt was malformed and/or you picked the wrong LLM, the wrong human, or possibly all three. Whatever the way, the results are built on quicksand; you'll need to start over.

If no sources are provided? Well now: "If there ain't no sources it never happened."

Compare double-entry bookkeeping. It needs to all add up. If you're 1 cent off, that means something is broken. Idem if a single reference is off, it polluted the context. (This works for human-generated and hybrid documents too. Polluted reasoning is polluted reasoning. The process is what counts.)

flail · 2026-05-15T17:00:39 1778864439

A quick smoke test, then. Gemini 3, Thinking Mode. The article: https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... The prompt: literally what you suggested.

Gemini: The article focuses on the environmental and human labor costs of scaling Artificial Intelligence, specifically focusing on water usage, electricity, and "ghost work."

Which is hilarious, since the article doesn't even mention the words "water" or "electricity." Gemini remains unfazed, reporting the links that are not in the article (some don't exist at all) to make the final ruling: "The Tech Trenches document is highly accurate in its citations."

Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?

Kim_Bruning · 2026-05-15T18:42:34 1778870554

Ah! I finally got you somewhat replicated! It's https://gemini.google.com , when you use the free model.

* https://gemini.google.com/share/6bd33176b27c

Right, so https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... is actually a substack, gemini is blocked from accessing it, and is bouncing off and hallucinating instead. Ok, that's an actual bug, that should not lead to the model starting to hallucinate. Imo the correct response should have been to fail loudly; which would have been a verification signal of its own.

ps: See also: https://hackertimes.com/item?id=48087485 ... I'm starting to think of it as "english is a new scripting language". Clearly the downside is that certain "runtime environments" are not compatible. %-/

Kim_Bruning · 2026-05-15T20:24:06 1778876646

https://techtrenches.dev/p/the-human-cost-of-10x-how-ai-is-p... "Follow each link in this document. Read each link's contents against the contents in this document. Create a report: for each link list a working hyperlink, whether it exists, what claim it supports, whether it supports or fails to support it, and why. If unable to fetch the initial document, Stop and report failure."

And now it errors out on gemini.google.com. . This is like early days unix scripting; I didn't add the equivalent of "#!/bin/bash -euo pipefail" ; and I didn't catch it because most systems already include something like it in their ".bashrc" (system prompt or weights) anyway.

This is so frustrating. I'm sorry. It's like the 1980's 8 bit era again, some systems actually work, others are terrible, and I didn't realize it can be like this for some folks. You could come away with the conclusion that this whole "computer" thing is all just a fad that'll never amount to anything. (meanwhile , the program works perfectly on my own machine, right over here of course %-) )

Kim_Bruning · 2026-05-15T21:56:31 1778882191

> Now, I know. Had I used Claude Code with relevant skills, it would have done better. But would it be good?

Wait. Why do I suddenly suspect you were on to me this whole time?

Very Well. Here's a skill that does the thing; you tell me: https://vps.kimbruning.nl/link-verifier.skill

While building, I realized I could actually make the whole thing a lot better, and really dig into sources. But... it's a start.

+ Output on your url. Ugly, but works: https://claude.ai/public/artifacts/d465a07b-378c-4089-b885-6...

simianwords · 2026-05-15T17:44:57 1778867097

Gemini is famously bad at these things. Try using ChatGPT.

Kim_Bruning · 2026-05-15T17:17:15 1778865435

Interesting! Where did you apply it? Can you show your output in more detail?

It's more like a small script, and it's supposed to extract urls and generate a table.

Here's my result in Claude Web for comparison:

https://claude.ai/public/artifacts/d76936f2-c97b-4bff-9205-2...

Claude web finds a number of small discrepancies in the sources, which I manually crosschecked and seem consistent with a human mixing things up slightly.

+ I also tested in gemini 3 flash preview, which generates an actual table (twice). It doesn't flag any discrepancies, which is consistent with it being a weaker model. But the urls and claims are listed and line up, so you've got your verification table to work with. (it's a semantic formatting task, so that part would be hard to mess up)

+ Gemini 3.1 pro yields a fairly aggressive report. https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

+ ChatGPT free (specific model not listed) needed 2 tries, didn't properly follow the prompt even then. I guess I got what I paid for, and I needed to download; https://vps.kimbruning.nl/productivity/Ai%20Productivity%20A... (pdf), https://vps.kimbruning.nl/productivity/ai_productivity_artic... (md)

+ Kimi K2.6 instant: https://www.kimi.com/share/19e2cc40-d012-89bf-8000-00006267f...

+ Summary of results. https://claude.ai/public/artifacts/10a42111-a0ee-42f3-b6d2-a... All of the models extracted the URLs into a table just fine, and that part at least is a lot easier than writing a perl script used to be in the '90s ;-) . The first part is the important bit so you as a human can "check your fucking sources". The second part the models handle variously, each does find discrepancies. None of them find all of them, but that makes sense: this is a fairly polished piece and it ideally shouldn't have discrepancies at all to begin with.

So: it worked as a smoke check just fine in the above. Doing more than a quick smoke check obviously requires a somewhat more involved procedure.

throw310822 · 2026-05-15T15:57:39 1778860659

I would love to do it at scale on many online publications, and publish the results. That would teach 'em.

flail · 2026-05-13T13:43:06 1778679786

Everything else being the same, more resources are better than less. Yet, VC money comes with strings attached.

VC doesn't want a startup to become just a healthy business. It needs to grow at a breakneck speed. In fact, for a VC, it's better to put pressure on somewhat successful startups to take a moonshot at becoming unicorns, even at the grave risk of going bust instead.

The expectation to spend the funding round in 12-18 months is a well-established pattern. So you get millions, but you have to spend it fast.

Running a product development consultancy, I routinely see products/businesses that could have been built for a fraction of what they cost. You don't need to hire hundreds of developers (pre-2026) and instantly have huge misalignment and coordination issues. You don't need to tokenmaxx the crap of everything (2026), ballooning your AI spend and generating a ton of bloat. That is, unless someone pressures you to spend fast because it's their shot at you becoming a unicorn.

flail · 2026-04-20T23:13:27 1776726807

"Even without AI-generated code, [code review] is already a major failing."

By all means, yes. Yet, it feels like we were playing a catch-up game (to a degree), and no one intentionally shipped unreviewed code. Now, reviewed without comprehension becomes standard, unreviewed & unread increasingly happens. That's a different kind of reckless.

"Open source adds thousands of (unpaid) eyes to code review."

True. And the open source community sees a massive inflow of AI-generated pull requests, which floods their capabilities to review. Leaving the ecosystem as it is means it will be dead. Thus, I assume resistance or evolution. And we definitely see some of the former, with some open source codebases being closed for AI contributions.

"Hiring is now 'My AI versus Your AI; and the former real need of 'A qualified person for a suitable job' is lost in the fallout."

Yes, that's where hiring has headed. Which, coincidentally, has made everyone worse off (save for AI-for-hiring apps providers). Candidates have it harder to land a decent job. Companies talk to people who play the AI hiring game better, not the most suitable candidates. All while having the same number of candidates and the same number of jobs, but 100x as many resumes exchanged: https://brodzinski.com/2025/08/broken-ai-hiring.html

Which basically means that a resume has lost its value as a token of information exchange. And since we base the whole process on this very assumption (resume as a token of information), the system is due to be rewired eventually. And sooner rather than later. One random idea: how about creating limited traffic where people actually care at least enough to pay some token money: https://brodzinski.com/2025/12/pay-for-resume-read.html

"My humble suggestion is that our ultimate question be phrased as 'How much is enough?'"

Perfect question if we start from the grand scheme of things. I am afraid, though, that there is never enough. At some point, another billion means increased status. You could buy everything with the billions you had previously, so right now it's a virtual leaderboard between you and other billionaires. And the status game is, indeed, infinite. If you aren't winning now, you can chase the leader. If you are the leader, you try to escape the chase.

The "enough" question doesn't work just as well in a finer-grained context. If we want to figure out things like the evolution of a specific profession. Or consider how digital products will be built in the future. Or how well outsourcing your content generation to an AI agent would work in the long run.

flail · 2026-02-27T15:02:57 1772204577

Security is even a bigger issue than it looks at first glance. While security risk by omission was always a thing (AI or not), now we face a whole new level of risks, from prompt injection to creating malicious libraries to be used by coding agents: https://garymarcus.substack.com/p/llms-coding-agents-securit...

The most shallow security, however, seems easier. Now, you can get through an automated AI security audit every day for (basically) free. You don't have to hire specialists to run pen tests.

Which makes the whole thing even more challenging. Safe on the surface while vulnerable in the details creates the false sense of safety.

Yet, all these would be a concern only once a product is any successful. Once it is, hypothetically, the company behind should have money to fix the vulnerabilities (I know, "hypothetically"). The maintenance cost hits way earlier than that. It will kick in even for a pet personal project, which is isolated from the broader internet. So I treat it as an early filter, which will reduce the enthusiasm of wannabe founders.

pipejosh · 2026-02-27T15:50:51 1772207451

The automated audit only covers static analysis. When the agent actually runs, hitting MCP servers, making HTTP calls, getting responses back, that's where the real problems show up. Prompt injection through tool responses, malicious libraries that exfiltrate env vars, SSRF from agents that blindly follow redirects. Code audits miss all of it because this is a runtime and network problem, not a code quality problem.

Built Pipelock for this actually. It's a network proxy that sits between the agent and everything it talks to. Still early but the gap is real. https://github.com/luckyPipewrench/pipelock

flail · 2026-02-27T16:13:11 1772208791

Yes. And the more autonomously we create code, the more of these (and not only these) vulnerabilities we'll be adding. Combine that with the AI-automation in attacks, and you have an all-out security mess.

It's like a Petri dish for inventing new angles of security attacks.

Oh, and let's not forget that coding agents are non-deterministic. The same prompt will yield a different result each time. Especially for more complex tasks. So it's probably enough to wait till the vibe-coded product "slips." Ultimately, as a black hat hacker, I don't need all products to be vulnerable. I can work with those few that are.

pipejosh · 2026-02-27T16:18:58 1772209138

Agreed. The non-determinism makes traditional testing basically useless here. You can't write a test suite for "the agent decided to do something unexpected this time." Logging and runtime checks are the only way to catch the weird edge cases.

flail · 2026-02-19T17:02:21 1771520541

The question is not whether we like or want subscriptions, but rather whether we're used to them. And the answer is yes.

Given the choice, we'd be using Spotifys and Netflixes for free, and have ad-free Google. I don't expect that choice to be given to us.

AI tools won't change anything on that account. At best, we'll switch one subscription for another one, except that the latter will add a bill for the tokens we use.

flail · 2025-12-12T17:07:16 1765559236

There's a huge difference between nurses or teachers and Ivy League students. Namely, the former are not remotely as prestigious roles. I highly doubt there are 20 candidates for each nurse or teacher job.

Affirmative action happens when we discuss privileged positions. Spots at Ivy League colleges definitely are positions of privilege.

So if the situation under consideration were nursing, there wouldn't be such a discussion because there wouldn't be affirmative action in place.