When will you all learn that merely "telling" an LLM not to do something won't d...

Twirrim · 2026-03-30T02:15:52 1774836952

Even worse, explicitly telling it not to do something makes it more likely to do it. It's not intelligent. It's a probability machine write large. If you say "don't git push --force", that command is now part of the context window dramatically raising the probability of it being "thought" about, and likely to appear in the output.

Like you say, the only way to stop it from doing something is to make it impossible for it to do so. Shove it in a container. Build LLM safe wrappers around the tools you want it to be able to run so that when it runs e.g. `git`, it can only do operations you've already decided are fine.

juped · 2026-03-30T05:44:30 1774849470

Even even worse, angry all-caps shouting will make it more stupid, because it pushes you into a significantly stupider vector subspace full of angry all-caps shouting. The only thing that can possibly save you then is if you land in the even tinier Film Crit Hulk sub-subspace.

I touch on this a bit in the piece I wrote for normies, it helped a lot of people I know understand the tech a bit better.

svnt · 2026-03-30T09:12:45 1774861965

Is this true for anything beyond the simplest LLM architectures? It seems like as soon as you introduce something like CoT this is no longer the case, at least in terms of mechanism, if not outcome.

LuxBennu · 2026-03-30T03:22:26 1774840946

This is true for prohibitions but claude.md works really well as positive documentation. I run custom mcp servers and documenting what each tool does and when to use it made claude pick the right ones way more reliably. Totally different outcome than a list of NEVER DO THIS rules though, for that you definitely need hooks or sandboxing.

trenchgun · 2026-03-30T06:48:39 1774853319

Yes but this is probabilistic. Skill, documentation etc help by giving it the information it needs. You are then in the more correct probability distribution. Fine for docs, tips etc, but not good enough for mandatory things.

dolmen · 2026-03-30T07:07:07 1774854427

"more reliably" is still not "reliably".

xtajv · 2026-03-30T15:54:54 1774886094

The phrase "don't give them ideas" comes to mind.

heyethan · 2026-03-30T02:50:57 1774839057

Feels like a lot of people are still treating these tools like “smart scripts” instead of systems with failure modes.

Telling it not to do something is basically just nudging probabilities. If the action is available, it’s always somewhere in the distribution.

Which is why the boundary has to be outside the model, not inside the prompt.

viktorianer · 2026-03-30T22:28:16 1774909696

Agree completely. The middle ground between "please don't" and full sandboxing: run a validation script between agent steps. The agent writes code, a regex check catches banned patterns, the agent has to fix them before it can proceed. Sandboxing controls what the agent can do. Output validation controls what it gets to keep. Both are more reliable than prompt instructions.

DrewADesign · 2026-03-30T01:15:49 1774833349

That’s right, because we’re not developers anymore— we orchestrate writhing piles of insane noobs that generally know how to code, but have absolutely no instinct or common sense. This is because it’s cheaper per pile of excreted code while this is all being heavily subsidized. This is the future and anyone not enthusiastically onboard is utterly foolish.

jeswin · 2026-03-30T01:25:32 1774833932

My point is exactly that you need safeguards. (I have VMs per project, reduced command availability etc). But those details are orthogonal to this discussion.

However "Telling" has made it better, and generally the model itself has become better. Also, I've never faced a similar issue in Codex.

nottorp · 2026-03-30T07:15:00 1774854900

> sandbox it to the point where it is completely unable to do the things you're trying to stop

Why are permissions for these "agents" on a default allow model anyway?

mr_mitm · 2026-03-30T07:25:04 1774855504

What do you mean? By default, Claude asks for permission for every file read, every edit, every command. It gets exhausting, so many people run it with `--dangerously-skip-permissions`.

dwb · 2026-03-30T07:40:34 1774856434

It does not ask for permission for every file read, only those outside the project and not explicitly allowed. You can bypass project edit permission requests with “allow edits”, no need for “dangerously skip permissions”. Bash commands are harder, but you can allow-list them up to a point.

nottorp · 2026-03-30T07:31:48 1774855908

> so many people run it with `--dangerously-skip-permissions`

It's on the people then, not the "agent". But why doesn't Claude come with a decent allow list, or at least remember what the user allows, so the spam is reduced?

mr_mitm · 2026-03-30T07:37:02 1774856222

You have the option to "always allow command `x.*`", but even then. The more control you hand over to these things, the more powerful and useful (and dangerous) they become. It's a real dilemma and yet to be solved.

biglost · 2026-03-30T01:13:13 1774833193

I use a script wrapper of git un muy path for claude, but as you correctly said, I'm not sure claude Will not ever use a new zsh with a differentPATH....