I honestly think this is going to be a big part of what remains when AI is doing everything we currently think of as our work. Legally or morally, some things need a human in the loop.
Am I the only one who is mystified by this whole idea? People aren't CPU's. Good luck getting them to follow the code that you thought you were using to define their roles. On the contrary, what makes any complex system work is flexibility. And yes, if that calls into question the whole regulatory regime some companies (believe they) live under ... well, yes.
Also, would you even want it to? I've worked for companies with very rigorous compliance before. They are dead companies walking in most cases. As soon and their business model required/requires any significant change, they are toast. This is because these types of rules can't possibly cover all cases, just the ones the managers know about. Innovation requires flexibility and creativity and rules based systems are the opposite of that. By their very nature, they introduce the exact situations the rules can't cover.
That does raise the question of what the value is of a "skill" vs a "command". Claude Code supports both, and it's not entirely clear to me when we should use one vs the other - especially if skills work best as, well, commands.
The practical distinction I've found: commands are atomic operations (lint, format, deploy), while skills encode multi-step decision trees ("implement feature X" which might involve reading context, planning, editing multiple files, then validating).
For context window management, skills shine when you need progressive disclosure - load only the metadata initially, then pull in the full instructions when invoked. This matters when you have 20+ capabilities competing for limited context.
That said, the 56% non-invocation rate mentioned elsewhere in this thread suggests the discovery mechanism needs work. Right now "skill as a fancy command" may be the only reliable pattern.
IMO the value and differentiating factor is basically just the ability to organize them cleanly with accompanying scripts and references, which are only loaded on demand. But a skill just by itself (without scripts or references) is essentially just a slash command with metadata.
Another value add is that theoretically agents should trigger skills automatically based on context and their current task. In practice, at least in my experience, that is not happening reliably.
That doesn't work very well if your developers are on Windows (and most are). Uneven Git support for symbolic links across platforms is going to end up causing more problems than it solves.
It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.
I use Claude pretty extensively on a 2.5m loc codebase, and it's pretty decent at just reading the relevant readme docs & docstrings to figure out what's what. Those docs were written for human audiences years (sometimes decades) ago.
I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.
Claude can always self discover its own context. The question becomes whether it's way more efficient to have it grepping and lsing and whatever else it needs to do randomly poking around to build a half-baked context, or whether having a tailor made context injection that is dynamic can speed that up.
In other words, if you run an identical prompt, one with skill and one without, on a test task that requires discovering deeply how your codebase works, which one performs better on the following metrics, and how much better?
It's not about one with skill and one without, but about one with skill vs one with regular old human documentation for stuff you need to know to work on a repo/project, or even more accurate comparison, take the skill and don't load it as a skill and just put it as context in the repo.
I think the main conflict in this thread is whether skills are anything more than just structuring documentation you were lacking in your repo, regardless if it was for Claude or Steve starting from scratch.
well, the key difference is that one is auto-injected into your context for dynamic lookup and the other is loaded on-demand as needed and is contingent upon the llm discovering it.
That difference alone likely accounts for some not insignificant discrepancies. But without numbers, it's hard to say.
To clarify, when I mentioned the bitter lesson I meant putting effort into organising the "skills" documentation in a very specific way (headlines, descriptions, etc).
Splitting the docs into neat modules is a good idea (for both human readers and current AIs) and will continue to be a good idea for a while at least. Getting pedantic about filenames, documentation schemas and so on is just bikeshedding.
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?
> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.
> The memory tool enables Claude to store and consult information outside the context window through a file-based system.
But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some housekeeping.
For agent, read sub-agent. E.g. the contents of your .claude/agents directory. When Claude Code spins up an agent, it provides the sub-agent with a prompt that combines the agents prompt and information composed by Claude from the outer context based on what Claude thinks needs to be communicated to the agent. Claude Code can either continue, with the sub-agent running in the background, or wait until it is complete. In either case, by default, Claude Code effectively gets to "check in" on messages from the sub-agent without seeing the whole thing (e.g. tool call results etc.), so only a small proportion of what the agent does will make it into the main agents context.
So if you want to do this, the current workaround is basically to have a sub-agent carry out tasks you don't want to pollute the main context.
I have lots of workflows that gets farmed out to sub-agents that then write reports to disk, and produce a summary to the main agent, who will then selectively read parts of the report instead of having to process the full source material or even the whole report.
OK, so you are essentially using sub-agents as summarizing tools of the main agent, something you could implement by specialized tools that wrap independent LLM calls with the prompts of your sub-agents.
That is effectively how sub-agents are implemented at least conceptually, and yes, if you build your own coding agent, you can trivially implement sub-agents by having your coding agent recursively spawn itself.
Claude Code and others have some extras, such as the ability for the main agent to put them in the background, spawn them in parallel, and use tool calls to check on the status of them (so basic job control), but "poor mans sub-agents" only requires the ability for the coding agent to run an executable the equivalent of e.g. "claude --print <someprompt" (the --print option is real, and enables headless use; in practise you'd also want --stream-json, set allowed tools, and specify a conversation id so you can resume the sub-agents conversation).
And calling it all "summarising" understates it. It is delegation, and a large part of the value of delegation in a software system is abstraction and information hiding. The party that does the delegation does not need to care about all of the inner detail of the delegated task.
The value is not the summary. The value is the work done that the summary describes without unnecessary detail.
how is it different or better than maintaining an index page for your docs? Or a folder full of docs and giving Claude an instruction to `ls` the folder on startup?
It's hard to tell unless they give some hard data comparing the approaches systematically.. this feels like a grift or more charitably trying to build a presence/market around nothing. But who knows anymore, apparently saying "tell the agent to write it's own docs for reference and context continuity" is considered a revelation.
Not sure why you’re being downvoted so much, it’s a valid point.
It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.
What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.
I'm one of those really odd beasts that feels some sort of loyalty to Microsoft, so I started out on Copilot and was very reluctant to try Claude Code. But as soon as I did, I figured out what the hype was about. It's just able to work over larger code bases and over longer time horizons than Copilot. The last time I tried Copilot, just to compare, I noticed that it would make some number of tool calls (not even involving tokens!) and then decide, "Nah, that's too many. We're just not going to do any work for a while." It was bizarre. And sometimes it would decide that a given bog-standard tool call (like read a file or something) needed to get my permission every. single. time. I couldn't do anything to convince it otherwise. I eventually gave up. And since then, we've built all our LLM support infrastructure around Claude Code, so it would be painful to go back to anything else.
I don't really like how Claude Code kind of obscures the actual code from you - I guess that's why people keep putting out articles about how certain programmers have absolutely no idea whats going on inside the code.
It's truly more capable but still not capable enough that Im comfortable blindly trusting the output.
That's the big difference for me. I use Github Copilot because I want to see the output and work with it. For people who are fine just shooting a prompt out and getting code back, I'm sure Claude Code is better.
> Claude Code kind of obscures the actual code from you
not sure what you mean, I have vscode open and make code changes in between claude doing its thing. I have had it revert my changes once which was amusing. Not sure why it did that, I've also seen it make the same mistake twice after being told not to.
This is not a problem when you assume the role of an architect and a reviewer and leave the entirety of the coding to Claude Code. You'll pretty much live in the Git Changes view of your favorite IDE leaving feedback for Claude Code and staging what it managed to get right so far. I guess there is a leap of faith to make because if you don't go all the way and you try to code together with Claude Code, it will mess with your stuff and undo a lot of it and it's just frustrating and not optimal. But if you remove yourself from the loop completely, then indeed you'll have no idea what's going on. There still needs to be a human in the loop, and in the right part of it, otherwise you're just vibe coding garbage.
This is an N of 1, of course, but I can relate to the other folks who've been expressing their frustration with the state of Claude over the last couple weeks. Maybe it's just that I have higher expectations, but... I dunno, it really seems like Claude Code is just a lot WORSE right now than it was a couple weeks ago. It has constant bugs in the app itself, I have to babysit it a lot tighter, and it just seems ... dumber somehow. For instance, at the moment, it's literally trying to tell me, "No, it's fine that we've got 500 failing tests on our feature branch, because those same tests are passing in development."