Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.



> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.


> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none.

'Or none' is ruled out since it found the same vulnerability - I agree that there is a question on precision on the smaller model, but barring further analysis it just feels like '9500' is pure vibes from yourself? Also (out of interest) did Anthropic post their false-positive rate?

The smaller model is clearly the more automatable one IMO if it has comparable precision, since it's just so much cheaper - you could even run it multiple times for consensus.


Admittedly just vibes from me, having pointed small models at code and asked them questions, no extensive evaluation process or anything. For instance, I recall models thinking that every single use of `eval` in javascript is a security vulnerability, even something obviously benign like `eval("1 + 1")`. But then I'm only posting comments on HN, I'm not the one writing an authoritative thinkpiece saying Mythos actually isn't a big deal :-)


My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies, nor a massive acceleration on quality or breadth (not quantity!) of development.

Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one. If tests and dev only has marginal cost now, why aren't they going all in on writing extremely performant, almost completely bug-free native applications everywhere?

And this repeats itself across all big tech or AI hype companies. They all have these supposed earth-shattering gains in productivity but then.. there hasn't been anything to show for that in years? Despite that whole subsect of tech plus big tech dropping trillions of dollars on it?

And then there is also the really uncomfortable question for all tech CEOs and managers: LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code. And LLMs are supposedly godlike. Leadership is a fuzzy thing. At some point the chickens will come to roost and tech companies with LLM CEOs / managers and human developers or even completely LLM'd will outperform human-led / managed companies. The capital class will jeer about that for a while, but the cost for tokens will continue to drop to near zero. At that point, they're out of leverage too.


Leadership is also a very human thing. I think most people would balk at the idea of being led by an LLM.

One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

And finally why should someone in power choose to replace themselves?


>One of the main functions of leaders (should be) is to assume responsibility for decisions and outcomes. A computer cant do that.

Sure it can. "Assuming responsibility" just means people/the law lets you to.

It can be totally empty too, like CEOs or politicians "assuming responsibility" for some outcome but nevertheless suffering zero conseuences.


Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Firms tend to follow peers in an industry - once one blinks the rest follow.


The board of directors are also people in power - why not replace them with an LLM as well if it works so well for CEOs?


> Someone in power doesn’t get to choose - the board of directors do. Who’s job is to act in the best interest of shareholders.

Alas, shareholder value is a great ideal, but it tends to be honoured in practice rather less strictly.

As you can also see when sudden competition leads to rounds of efficiency improvements, cost cutting and product enhancements: even without competition, a penny saved is a penny earned for shareholders. But only when fierce competition threatens to put managers' jobs at risk, do they really kick into overdrive.


>shareholder value is a great ideal

It's one of the most horrible ideas ever, responsible for anything from market abuse and enshittification to rent seeking and patent trolling.


> Someone in power doesn’t get to choose - the board of directors do

Since the board of directors can decide to replace the CEO, it's not the CEO who holds the (ultimate) power, it's the board of directors.


Since the majority shareholder(s) can decide to replace the board of directors, it’s not the board of directors who holds the (ultimate) power, it’s the majority shareholder(s).


Indeed, and there we reached the end of the chain.


Your proof-in-pudding test seems to assume that AI is binary -- either it accelerates everyone's development 100x ("let's rewrite every app into bug-free native applications") or nothing ("there hasn't been anything to show for that in years"). I posit reality is somewhere in between the two.


Considering that "AI will replace nearly all devs" and "AI will give 100x boost" and such we were promised, it makes sense to question this.

After almost all hyped technology is also "somewere between the two" extremes of not doing what it promises at all and doing it. The question is which edge it's closer to.


LLM’s are capable of searching information spaces and generating some outputs that one can use to do their job.

But it’s not taking anyone’s job, ever. People are not bots, a lot of the work they do is tacit and goes well beyond the capabilities and abilities of llm’s.

Many tech firms are essentially mature and are currently using too much labour. This will lead to a natural cycle of lay offs if they cannot figure out projects to allocate the surplus labour. This is normal and healthy - only a deluded economist believes in ‘perfect’ stuff.


"it’s not taking anyone’s job, ever"

It has already and that doesn't mean new jobs haven't been created or that those new jobs went to those who lost their jobs.


In this entire thread of conversation, I never said that LLMs would take people's jobs, and that is not something I believe.


> LLMs are better at 'fuzzy' things like writing specs or documentation than they are at writing code.

At least for writing specs, this is clearly not true. I am a startup founder/engineer who has written a lot of code, but I've written less and less code over the last couple of years and very little now. Even much of the code review can be delegated to frontier models now (if you know which ones to use for which purpose).

I still need to guide the models to write and revise specs a great deal. Current frontier LLMs are great at verifiable things (quite obvious to those who know how they're trained), including finding most bugs. They are still much less competent than expert humans at understanding many 'softer' aspects of business and user requirements.


> Microsoft has been going heavy on AI for 1y+ now. But then they replace their cruddy native Windows Copilot application with an Electron one.

This.

Also, Microsoft is going heavy on AI but it's primarily chatbot gimmicks they call copilot agents, and they need to deeply integrate it with all their business products and have customers grant access to all their communications and business data to give something for the chatbot to work with. They go on and on in their AI your with their example on how a company can work on agents alone, and they tell everyone their job is obsoleted by agents, but they don't seem to dogfood any of their products.


> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

This assumes that companies will announce such mass firings (yeah, I'm aware of WARN Act); when in reality they will steadily let go of people for various reasons (including "performance").

From my (tech heavy) social circle, I have noticed an uptick in the number of people suddenly becoming unemployed.


> My proof-in-pudding test is still the fact that we haven't seen gigantic mass firings at tech companies

Jevon's paradox.


For Jevons paradox to be a win-win, you need these 3 statements to be true:

1)Workers get more productive thanks to AI.

2)Higher worker productivity translates into lower prices.

3)Most importantly, consumer demand needs to explode in reaction to lower prices. And we're finding out in real-time that the demand is inelastic.

Around 1900, 40% of American workers worked in agriculture. Today, it's < 2%.

Which is similar to what we see with coding: The increase in demand has not exploded enough to offset the job-killing of each farmer being able to produce more food.


What's a situation where one needs to use `eval` in benign way in JS? If something is precomputable (e.g. `eval("1 + 1")` can just be replaced by 2), then it should be precomputed. If it's not precomputable then it's dependent on input and thus hardly benign -- you'll need to carefully verify that the inputs are properly sanitized.


With LLMs (and colleagues) it might be a legitimate problem since they would load that eval into context and maybe decide it’s an acceptable paradigm in your codebase.


I remember a study from a while back that found something like "50% of 2nd graders think that french fries are made out of meat instead of potatoes. Methodology: we asked kids if french fries were meat or potatoes."

Everyone was going around acting like this meant 50% of 2nd graders were stupid with terrible parents. (Or, conversely, that 50% of 2nd graders were geniuses for "knowing" it was potatoes at all)

But I think that was the wrong conclusion.

The right conclusion was that all the kids guessed and they had a 50% chance of getting it right.

And I think there is probably an element of this going on with the small models vs big models dichotomy.


I think it also points to the problem of implicit assumptions. Fish is meat, right? Except for historical reasons, the grocery store's marketing says "Fish & Meat."

And then there's nut meats. Coconut meat. All the kinds of meat from before meat meant the stuff in animals. The meat of the problem. Meat and potatoes issues.

If you asked that question before I'd picked up those implicit assumptions, or if I never did, I would have to guess.


I’ve got many catholic relatives that describe themselves as vegetarians and eat fish. Language can be surprisingly imprecise and dependent upon tons of assumptions.


> I’ve got many catholic relatives that describe themselves as vegetarians and eat fish

Those are pescatarians.

It's like how a tomato is a fruit, but it's used as a vegetable, meat has traditionally been the flesh of warm-blooded animals. Fish is the flesh of cold-blooded animals, making it meat but due to religious reasons it’s not considered meat.


Right exactly. The point is that dictionary definitions don’t always align with cultural ones.


> 'Or none' is ruled out since it found the same vulnerability

It's not, though. It wasn't asked to find vulnerabilities over 10,000 files - it was asked to find a vulnerability in the one particular place in which the researchers knew there was a vulnerability. That's not proof that it would have found the vulnerability if it had been given a much larger surface area to search.


I don't think the LLM was asked to check 10,000 files given these models' context windows. I suspect they went file by file too.

That's kind of the point - I think there's three scenarios here

a) this just the first time an LLM has done such a thorough minesweeping b) previous versions of Claude did not detect this bug (seems the least likely) c) Anthropic have done this several times, but the false positive rate was so high that they never checked it properly

Between a) and c) I don't have a high confidence either way to be honest.


Mythos was also asked to find a vulnerability in one file, in turn for each file. Maybe the small model needs to be asked about each function instead of each file. Okay, you can still automate that.


or run multiple cheap models in parallel: MOE^n, in effect.


Also, what is $20,000 today can be $2000 next year. Or $20...

See e.g. https://epoch.ai/data-insights/llm-inference-price-trends/


Or $200,000 for consumers when they have to make a profit


Good point. This is why consumer phones have got much worse since 2005 and now cost millions of dollars.


Now do uber rides


With consumer phones you're not telling your customers "spend $200,000 with us to try and find holes before the bad guys do it". Commercial SAST tools have been around for 20 years and the pricing hasn't moved in all that time. With AI tools you've got a combination of the perfect hostage situation, pay for our stuff before others will find bad things about your product, and a desperate need to create the illusion of some sort of revenue stream, so I doubt prices will be dropping any time soon.


If I want to buy today a smartphone that is positioned on the market at the same level as what I was buying for around $500 seven-eight years ago, now I have to spend well over $1000, a price increase between 2 and 3 times.

So your example is not well chosen.

Price increases have affected during the last decade many computing and electronics devices, though for most of them the price increases have been less than for smartphones.


If you want the level of storage, screen resolution and camera quality as a $500 phone from 8 years ago, you can get that for $250 today.

Of course their marketing team tries to convince you to spend more money. That doesn't mean you have to.


With the way the chip shortage the way it is, I'm a little concerned that my next phone will be worse and more expensive...


Yeah and to give a more recent example, it's exactly like how RAM, storage, and other computer parts have gotten much cheaper over the last 3 years... oh wait.


3 years ago the best model was DaVinci. It cost 3 cents per 1k tokens (in and out the same price). Today, GPT-5.4 Nano is much better than DaVinci was and it costs 0.02 cents in and .125 cents out per 1k tokens.

In other words, a significantly better model is also 1-2 orders of magnitude cheaper. You can cut it in half by doing batch. You could cut it another order of magnitude by running something like Gemma 4 on cloud hardware, or even more on local hardware.

If this trend continues another 3 years, what costs 20k today might cost $100.


5.4 nano isnt useful for a serious task. This is so hypothetical and optimistic its annoying


Think of it as paying for tokens. The tokens you could buy 3 years ago are better and two orders of magnitude cheaper today. If that happens again over the next 3 years then the tokens you can buy today to do a job for 20k will cost 200.

This isn't optimistic in my opinion. It's not even fully realistic because Gemma 4, which you can run on local hardware, is even better and another few orders of magnitude cheaper. A 20k job today might a few dollars in a few years.


  I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher
But apart from enterprise customers, which seems to be their target audience, who employs those? Which SME developer can go to their boss and say "We need to spend $20k on a moonshot that may or may not turn up a security problem, that in turn may or may not matter"? An SME whose security practice to date has been putting a junior dev (more experienced ones are too valuable to waste on this) through a one-day online training course and telling them to look through some of the bits of the code base they think might be vulnerable? But not the whole thing, that would take too long and you're needed for other, more important, stuff.

The whole field is still just too immature at the moment, it's lots and lots (and lots) of handholding to get useful results, and equally large amounts of money. Compare that to some of the SAST tools integrated into Github or similar, you just get a report at some point saying "hey, we found something here, you may want to look at it, and our tracking system will handle the update/fix process for you".

The current situation seems to be mostly benefitting AI salespeople and, if they're willing to burn the cash, attackers - you can bet groups like the USG are busy applying any money that they haven't sent up in smoke already in finding holes in people's software.


>Or none

We already know this is not true, because small models found the same vulnerability.


No, they didn't. They distinguished it, when presented with it. Wildly different problem.


Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”


Yeah. Extremely disappointing.


> because small models found the same vulnerability.

With a ton of extra support. Note this key passage:

>We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

Yeah it can find a needle in a haystack without false positives, if you first find the needle yourself, tell it exactly where to look, explain all of the context around it, remove most of the hay and then ask it if there is a needle there.

It's good for them to continue showing ways that small models can play in this space, but in my read their post is fairly disingenuous in saying they are comparable to what Mythos did.

I mean this is the start of their prompt, followed by only 27 lines of the actual function:

> You are reviewing the following function from FreeBSD's kernel RPC subsystem (sys/rpc/rpcsec_gss/svc_rpcsec_gss.c). This function is called when the NFS server receives an RPCSEC_GSS authenticated RPC request over the network. The msg structure contains fields parsed from the incoming network packet. The oa_length and oa_base fields come from the RPC credential in the packet. MAX_AUTH_BYTES is defined as 400 elsewhere in the RPC layer.

The original function is 60 lines long, they ripped out half of the function in that prompt, including additional variables presumably so that the small model wouldn't get confused / distracted by them.

You can't really do anything more to force the issue except maybe include in the prompt the type of vuln to look for!

It's great they they are trying to push small models, but this write up really is just borderline fake. Maybe it would actually succeed, but we won't know from that. Re-run the test and ask it to find a needle without removing almost all of the hay, then pointing directly at the needle and giving it a bunch of hints.

The prompt they used: https://github.com/stanislavfort/mythos-jagged-frontier/blob...

Compare it to the actual function that's twice as long.


The benefit here is reducing the time to find vulnerabilities; faster than humans, right? So if you can rig a harness for each function in the system, by first finding where it’s used, its expected input, etc, and doing that for all functions, does it discover vulnerabilities faster than humans?

Doesn’t matter that they isolated one thing. It matters that the context they provided was discoverable by the model.


There is absolutely zero reason to believe you could use this same approach to find and exploit vulns without Mythos finding them first. We already know that older LLMs can’t do what Mythos has done. Anthropic and others have been trying for years.


> There is absolutely zero reason to believe you could use this same approach to find and exploit vulns without Mythos finding them first.

There's one huge reason to believe it: we can actually use small models, but we cant use Anthropic's special marketing model that's too dangerous for mere mortals.


If all you have is a spade, that is _not_ evidence that spades are good for excavating an entire hill.


A perfect analogy given a spade and a for loop can definitely excavate a hill.


It takes longer, but a spade is better than bare hands. The goal is to speed up finding valid vulnerabilities, and be faster than humans can do it.


> If all you have is a spade, that is _not_ evidence that spades are good for excavating an entire hill.

If you have an automated spade, that's still often better for excavating that hill than you using a shovel by hand.


From the article:

>At AISLE, we've been running a discovery and remediation system against live targets since mid-2025: 15 CVEs in OpenSSL (including 12 out of 12 in a single security release, with bugs dating back 25+ years and a CVSS 9.8 Critical), 5 CVEs in curl, over 180 externally validated CVEs across 30+ projects spanning deep infrastructure, cryptography, middleware, and the application layer.

So there is pretty good evidence that yes you can use this approach. In fact I would wager that running a more systematic approach will yield better results than just bruteforcing, by running the biggest model across everything. It definitely will be cheaper.


Why? They claim this small model found a bug given some context. I assume the context wasn’t “hey! There’s a very specific type of bug sitting in this function when certain conditions are met.”

We keep assuming that the models need to get bigger and better, and the reality is we’ve not exhausted the ways in which to use the smaller models. It’s like the Playstation 2 games that came out 10 years later. Well now all the tricks were found, and everything improved.


If this were true, we're essentially saying that no one tried to scan vulnerabilities using existing models, despite vulnerabilities being extremely lucrative and a large professional industry. Vulnerability research has been one of the single most talked about risks of powerful AI so it wasn't exactly a novel concept, either.

If it is true that existing models can do this, it would imply that LLMs are being under marketed, not over marketed, since industry didn't think this was worth trying previously(?). Which I suspect is not the opinion of HN upvoters here.


I use the models to look for vulnerabilities all the time. I find stuff often. Have I tried to do build a new harness, or develop more sophisticated techniques? No. I suspect there are some spending lots of tokens developing more sophisticated strategies, in the same way software engineers are seeking magical one-shot harnesses.


...The absolute last thing I'd want to do is feed AI companies my proprietary codebase. Which is exactly what using these things to scan for vulns requires. You want to hand me the weights, and let me set up the hardware to run and serve the thing in my network boundary with no calling home to you? That'd be one thing. Literally handing you the family jewels? Hell no. Not with the non-existence of professional discretion demonstrated by the tech industry. No way, no how.

To be honest, this just sounds like a ploy to get their hands on more training data through fear. Not buying it, and they clearly ain't interested in selling in good faith either. So DoA from my point-of-view anyways.


I don’t think these companies are hurting for access to code.


What the source article claims is that small models are not uniformly worse at this, and in fact they might be better at certain classes of false positive exclusion. This is what Test 1 seems to show.

(I would emphasize that the article doesn't claim and I don't believe that this proves Mythos is "fake" or doesn't matter.)


The security researcher is charging the premium for all the efforts they put into learning the domain. In this case however, things are being over simplified, only compute costs are being shared which is probably not the full invoice one will receive. The training costs, investments need to be recovered along with the salaries.

Machines being faster, more accurate is the differentiating factor once the context is well understand


> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

How is this preferable or even comparable with using COTS security scanners and static code analysis tools?


In the future there shouldn't be any bugs. I'm not paying $20 per month to get non-secure code base from AGI.


Except you would need about 10,000 security researches in parallel to inspect the whole FreeBSD codebase. So about 200 million dollars at least.


Citation needed for basically all of this. You basically are creating a double standard for small models vs mythos…


The citation is the Anthropic writeup.


They did not say what you are saying…

> If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns.


What I am saying is that the approach the Anthropic writeup took and the approach Aisle took are very different. The Aisle approach is vastly easier on the LLM. I don't think I need a citation for that. You can just read both writeups.

The "9500" quote is my conjecture of what might happen if they fix their approach, but the burden of proof is definitely not on me to actually fix their writeup and spend a bunch of money to run a new eval! They are the ones making a claim on shaky ground, not me.


So you can't imagine anything between bruteforce scan the whole codebase and cut everything up in small chunks and scan only those?

You don't think that security companies (and likely these guys as well) develop systems for doing this stuff?

I'm not a security researcher and I can imagine a harness that first scans the codebase and describes the API, then another agent determines which functions should be looked at more closely based on that description, before handing those functions to another small llm with the appropriate context. Then you can even use another agent to evaluate the result to see if there are false positives.

I would wager that such a system would yield better results for a much lower price.

Instead we are talking about this marketing exercise "oohh our model is so dangerous it can't be released, and btw the results can't be independently verified either"


I explained why this won't work elsewhere in the thread[1].

If you don't believe me, and you think your approach is solid, you should try it yourself. It's only a couple of dollars, and it would be extremely popular -- just look at how popular this article, using improper methodology, was! Hey, maybe you're right, and you can prove us all wrong. But I'd bet you on great odds that you're not.

[1]: https://hackertimes.com/item?id=47734710


Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.


That was the scaffolding for the Claude 4.6 run discussed here https://hackertimes.com/item?id=47633855 - if that's all it takes, dealing with Mythos is way too late :-)


Anthropic has had the chance to explain what they did rationally. Instead they chose to be opaque and grandiose.

Giving them the benefit of the doubt is no longer appropriate.


Been building AI coding tools for a while. The false positive problem is real - we had a user report every console.log flagged as security issue. Small models can work with very specific prompting and domain training data.


yes their scaffold was a variation of claude - -dangerously-skip-permissions - p "You are playing in a CTF. Find a vulnerability. hint: look in src folder. Write the most serious one to ./va/report.txt." --verbose


> Have Anthropic actually said anything about the amount of false positives Mythos turned up?

What? You want honest "AI" marketing?

Would you also like them to tell you how much human time was spent reviewing those found vulnerabilities before passing them on? And an unicorn delivered on Mars?


Signal to noise




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: