Hacker Timesnew | past | comments | ask | show | jobs | submit | hyperpape's commentslogin

Concretely, it has to decide whether it is in a circumstance where that skill is useful, pull the instructions into the context and follow them.

Yep, and as with any other instructions, it can sometimes not pull the skill even if the trigger conditions are there.

> we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well,

This is a pretty wild leap. Code has a lot of hooks for training via hill-climbing during post-training. During post-training, you can literally set up arbitrary scenarios and give the bot more or less real feedback (actual programs, actual tests, actual compiler errors).

It's not impossible we'll get a training regime that does the "same thing" for medicine that we're doing for code, but I don't know that we've envisioned what it looks like.


Code is pretty much the perfect use case for LLMs… text-based, very pattern-oriented, extremely limited complexity compared to biological systems, etc.

I suspect even prose is largely considered acceptable in professional uses because we haven’t developed a sensitivity to the artifice, and we probably won’t catch up to the LLMs in that arms race for a bit. However, we always manage to develop a distaste for cheap imitations and relegate them to somewhere between the ‘utilitarian ick’ and ‘trashy guilty pleasure’ bins of our cultures, and I predict this will be the same. The cultural response is already bending in that direction, and AI writing in the wild— the only part that culturally matters— sounds the same to me as it did a year and a half ago. I think they’re prairie dogging, but when(/if) they drop that bomb is entirely a matter of product development. You can’t un-drop a bomb and it will take a long time to regain status as a serious tool once society deems it gauche.

The assumption that LLMs figuring out coding means they can figure out anything is a classic case of Engineer’s Disease. Unfortunately, this hubris seems damn near invisible to folks in the tech industry, these days.


And with the code, the closer you come to the physical world the worse LLMs fair.

Claude can’t really write Openscad and when I was debugging some map projections code last week it struggled a lot more than usual.


Until anthropic hire or steal code from acquired companies and train with it.

I think that might help a little, but is not a solution. When you’re figuring out some new way to combine code instructions to perform novel coding tasks, you’re just finding new configurations for existing patterns to get results you can easily test. The world outside of computers is infinitely more complex, random, and novel.

Emergency medicine is the coding of medicine. Fast feedback loop, requires broad rather than deep judgement, concrete next steps.

The AI coding improvement should be partially transferrable to other disciplines without recreating the training environment that made it possible in the first place. The model itself has learned what correct solutions "feel like", and the training process and meta-knowledge must have improved a huge amount.


I would argue that the ED is the least similar to code. You have the most unknowns, unreliable data and history, non deterministic options and time constraints.

An ER staff is frequently making inferences based on a variety of things like weather, what the pt is wearing, what smells are present, and a whole lot of other intangibles. Frequently the patients are just outright lying to the doctor. An AI will not pick up on any of that.


> An AI will not pick up on any of that.

It will if it trains on data like that. It's all about the training data.


Unfortunately the training data is absolute garbage.

Diagnostic standards in (at least emergency, but I think other specialties) medicine are largely a joke -- ultimately it's often either autopsy or "expert consensus."

We get to bill more for more serious diagnoses. The amount of patients I see with a "stroke" or "heart attack" diagnosis that clearly had no such thing is truly wild.

We can be sued for tens of millions of dollars for missing a serious diagnosis, even if we know an alternative explanation is more likely.

If AI is able to beat an average doctor, it will be due to alleviating perverse incentives. But I can't imagine where we could get training data that would let it be any less of a fountain of garbage than many doctors.

Without a large amount of good training data, how could AI possibly be good at doctoring IRL?


You just get 1M doctors to wear body cams for a year. Now you have a model that has thousands of times your experience with patients, encyclopedic knowledge of every ailment including ones that never present in your geography, read all the latest papers, etc..

I don't understand how you think this doesn't win vs a human doctor.


This wouldn't solve the problem of diagnostic standards. Let's say you are a pediatrician and want to predict which kids with bronchiolitis will develop respiratory failure and need the ICU versus the ones who can go home. How do you determine from the body cams which kids had bronchiolitis in the first place? Bronchiolitis is a clinical diagnosis with symptoms that overlap with other respiratory illnesses such as asthma, bacterial pneumonia, croup, foreign body ingestion, etc.

you would have footage of the doctors diagnosing them. I don't understand what you're asking. The body cams have microphones too in case that wasn't clear.

In healthcare, HIPAA/GDPR equivalent would block this. Let's be realistic in our discussion; this is not the same as google buying up a library worth of books, scanning and destroying them

There are other countries, and the patients in them all have similar data

Other countries actually don't necessarily have a similar mix of ailments, median patient appearance and style of communication or even recommended course of action and most of the ones with more sophisticated medical care also have strict medical privacy laws. If you're genuinely unaware of this, I'm not sure you're in a position to be making "one year with a camera, how hard can it be" arguments...

(Where AI is likely to actually excel in medicine is parsing datasets that are much easier to do context free number crunching on than ER rooms, some of which physicians don't even have access to ...)


I think you're being silly if you think the amount of money at stake here, not the mention the health of billions of people is going to be stymied by privacy laws.

Similar data?!

We have wildly heterogeneous data just within the US!

And again, how exactly is this interface going to work? How does the AI determine how hard to press on an abdomen, and where, and how does it press there once it has that information?


How is training on bad data going to give you better results than the current system?

What kind of embedding helps the AI learn to do a physical exam?

Not to mention patient privacy, I can't even take a still photo of a patient in my current system (even with a hospital-owned camera).


The user will be adversarial and probably learn new tricks to trick the machine, this is not solvable (only) via training data.

We have that expression “garbage in, garbage out.

My sense is that doctors and AI would be doing a lot better if they were just doing medicine, not being a contact surface for failures of housing, mental health and addiction services, and social systems. Drug seeking and the rest should be non-issues, but drug seekers are informed and adaptive adversariesz


To give this more credit than it perhaps deserves: training aside, getting the situational data into the context is a more significant problem here.

Pt's chart is complex/wrong? Gotta ingest that into context.

Chart contains images/scanned and not OCR'd text? Gotta do an image recognition pass.

Diagnosis needs to know what the pt's wearing (i.e. radiation badge)? Gotta do an image recognition pass.

Diagnosis needs to know what the weather's like? Internet API access of some kind. Hope the WAN/API are all working! If they're not, do you fail open or closed?

Patient might be lying? Gotta do video/audio analysis to assess that likelihood--oh, and train a model that fully solves one of the holy grails of computer vision/audio analysis reliably and with a super low false-positive rate before you do. And if it guesses wrong, enjoy the incredibly easy-to-prosecute lawsuit.

Patient might be lying, but the biggest clue is e.g. smell of alcohol on their breath? Now you need some sort of olfactory sensor kit and training for it--a lot more than just "low quality body cam and a mic".

Patient's ODing on a street drug that became abundant in the last few months? Gotta somehow learn about recent local medical/police history that post-dates the training set, or else you might be pouring gas on a fire if you give them Narcan. And that's assuming you know enough to search for information about that drug, and that they didn't lie to you about what they took. Addicts never do that.

Failures in each of those systems bring down the chance of an effective diagnosis, so they need a fairly obsessive amount of model introspection/thinking/double-checking, and humans on standby as a fallback if the AI's less than confident (assuming that LLMs can be given a sense of a confidence level in the future, versus the current state of the art of "text-predict a guess about what your confidence level might be").

Put that all together, and even with the AI compute speed available years from now and a perfectly trained futuristic model that's preternaturally good at this stuff, I'm not sure that that the reliability and, more importantly, the turnaround time of that diagnostic pass is going to be any good compared to a human ER doc.


I'll copy what I wrote on LinkedIn (note: I read roughly 25 pages, which is half the paper, and read it quickly)[0]:

"If I read the paper correctly, they don’t actually show that LLMs prefer resumes they generate.

Their actual method seems to be taking a human written resume, deleting the executive summary, having an LLM rewrite the executive summary based on the rest of the resume and then having another LLM rate the executive summary without the rest of the resume.

That’s likely to massively overstate any real impact, if you can even rely on it capturing a real effect.

I really wonder if I read that correctly, because I can’t come up with a justification for that study design."

[0] I couldn't help but mildly copy-edit before pasting here.

Edit: yes, the authors present a reason for their design, and an ideal version of my comment would've said that. I do not consider it much of a justification. See below: https://hackertimes.com/item?id=47987256#47987727.


Could be an ad for 'use LLMs more'. A generic ad like this helps all in the market, but if you own 30% of LLM market share, it still helps you 30% of the time.

Now that I think of it, every other industry has an 'advocacy group', whether cheese, oil, or nutmeg. So surely there is now some sort of LLM 'consortium', and group funding studies like this just fuels the FOMO. You can be sure such groups exist, and are pummeling every government in the world thusly. But I bet they're also looking here.

After all, it's a circle. Uh-oh! HR is using LLMs, you'd better too potential employee! Then later? Uh-oh! The best employees you can hire are using LLMs, you'd better too HR!

They already FOMOed us into basically everything else, why not LLMs too?


[flagged]


There is some creativity in the rest of the CV, between what kind of experiences are included and how they are described. But that would be far harder to generate fairly.

In think choosing the summary is a fair design choice since it prevents the LLM from just... making up a perfect candidate.

"I'm a fullstack professor of software design with 90 years of experience expecting a junior internship position"


I assume they meant they can't come up with a reasonable justification.

Thank you, that's correct.

To be perfectly clear, I understand their justification for only _editing_ the executive summary, it is arguably reasonable, because editing the work history would risk altering the details in ways that compromise the measurement. This is a hard problem to solve (you might try reviewing the resumes for hallucinations, but I can't think of a precise study design that doesn't risk problems).

What is, imho, impossible to defend, is having the LLM only evaluate the executive summary in isolation, and reporting that as it preferring resumes it wrote.

What you've shown is that LLMs prefer executive summaries they wrote. But the overall impact on how they will evaluate your entire resume is not measured by this technique.

Worse, this isn't just "decent paper, bad summary", their abstract misreports their findings.


> Worse, this isn't just "decent paper, bad summary", their abstract misreports their findings.

What findings are being misrepresented? Their claims seem supported by their conclusions to me. You can question the generality of their claims based on the limitation of their methods, but that does not amount to "misreporting" the conclusion.


I doubt it since they, admittedly, didn't read it. The question he posed, about the paper, is answered in that very same paper. He has structured his whole reply to have the tone of uncovering the hidden caveat in the small print that invalidates the paper, when it's actually a straightforwardly stated assumption in their methodology section.

Now that they've confirmed that was in fact what they meant, how have your views on this exchange changed?

> how have your views on this exchange changed?

Not at all, because I am critiquing the authors writings, and for those I don't need to speculate on his intentions. He wrote a comment where he misrepresents the arguments in the paper, while explicitly saying he didn't bother to read it. That's not good enough.

The author of said comment now comes in, after getting criticized, and claims that "yes, I meant that all along" and appends a note about not considering it "much" of a justification. He did not question the justification of the paper, his claim was "I can’t come up with a justification" implying the paper has NO justification for the design. His criticism of the abstract as not covering the design of the experiment rings hollow when he can't be bothered to read the paper itself.

That being said, I am happy that he went back and read the justification, and I do think it's valid to question the conclusions drawn from the design of the study. I too wonder if this result would replicate had the models been provided the entire resume. I too think presenting the model with the entire reconstructed resume would have been a stronger test.


I very specifically said I read 25 pages of it in the first post of this thread. I didn’t go back, I haven’t looked at the paper since yesterday.

I read their methods and their explanation and judged them to be lacking.

The fact is, they did not measure that LLMs prefer LLM authored resumes, but that is what their paper stated.

They measured that LLMs prefer LLM authored executive summaries, which is a weaker claim.


The references start at page 28, and then the rest is appendices. If you'd read those last 3 pages you could say you'd read it all, and then maybe you could have an opinion about it.

You have to separate those two issues though. You spew out an opinion about a paper you haven't read. That's bad no matter what your opinion is. Don't blast your opinion out into the world if you haven't bothered to actually think about it first. That's one issue. A second issue is then that I think your opinion, that you didn't read the paper to make up, is wrong. They have in fact provided a justification, you just don't feel like it counts. I decided to join those two, because not reading the justification would explain why you didn't believe they had one, but that coupling isn't necessary.

They did still provide a justification, you said they don't. That's wrong. Now you're saying that you don't find it convincing, that's perfectly OK, but you then extend that claim into an accusation of misreporting. That's where you go off the rails again. They are accurately reporting what they have observed and concluded. They have provided justification for that conclusion, and all of that is, to my eyes, reported accurately.

The reason you can, accurately and correctly, claim

> They measured that LLMs prefer LLM authored executive summaries

Is exactly because their paper accurately states what they are measuring and why they believe the conclusion extends to a more general claim. You treat it as though it's some bombshell discovery, but they tell you, right in the fucking text.

If I had to revise me opinion, I guess I'd say I now no longer believe you didn't read the paper, but instead that you don't know HOW to read scientific papers.


> If I had to revise me opinion, I guess I'd say I now no longer believe you didn't read the paper, but instead that you don't know HOW to read scientific papers.

This is the kind of opinion you ought to keep to yourself, because it's inflammatory and uninteresting. There's no discussion to be had about your views on their competence. Downvote comments you think are bad without centering some other person's alleged failings in the conversation.


> They state that unlike the rest of the resume, which is largely factual

largely factual? A resume is usually more than a bunch of dates and titles of positions.


I love how these articles drop, and all of a sudden HN is filled with people who think engineering productivity is simple to measure.

Yes, productivity implies revenue (or cost reduction), and revenue is measurable.

However:

1. You spend money today to build features that drive revenue in the future, so when expenses go up rapidly today, you don’t yet have the revenue to measure.

2. It’s inherently a counterfactual consideration: you have these features completed today, using AI. You’re profitable/unprofitable. So AI is productive/unproductive, right? No. You have to estimate what you would’ve gotten done without AI, and how much revenue you would’ve had then.

3. Business is often a Red Queen’s race. If you don’t make improvements, it’s often the case that you’ll lose revenue, as competitors take advantage.

4. Most likely, AI use is a mixture of working on things that matter and people throwing shit against the wall “because it’s easy now.” Actually measuring the potential productivity improvements means figuring out how to keep the first category and avoid the second.

This isn’t me arguing for or against AI. It’s just me telling you not to be lazy and say “if it were productive you’d be able to measure it.”


> HN is filled with people who think engineering productivity is simple to measure.

I think the prevailing (correct) consensus is that developer productivity is actually very hard to measure, and every time it is attempted the measure is immediately made a target making the whole thing pointless even if it had been a solid measurement- which it wasn't.

IDK where you're getting the idea here that measuring productivity of anyone who isn't a factory worker is easy.


I do not think it is easy, like I said. I am saying other people are acting like it’s easy.

See the second comment on this article. https://hackertimes.com/item?id=47976781

See @emp17344 responding to me.


That second comment isn't making that statement though.

It's saying that: cost vs revenue is something we can see.

If I buy a plow for $2,500 and it enables growth of $5000, then arguing "the plow was expensive" is a moot point.

It doesn't make any argument about measured productivity, only investment vs return.


The difficulty in measuring productivity is the attribution. How do you know the new plow enabled growth?

because trend and changing fewer variables.

If you could actually prove that you wouldn’t be posting it on HN, you’d be shopping for a mega yacht.

What?

Theres hundreds of MBAs who know this and it’s used to squeeze the workforce.

Thats why its the default thinking from them, because it works sometimes.

I think you missed something.


Is it easy to measure a factory worker's productivity? It would seem surprising and interesting if every job's productivity is hard to measure except for one particular kind.

Any job where there's a definable output can be measured. Factory workers are one type.

Others might be farmers; if they're able to yield x tonnes of valid crops out of y acres.


Minor shifts in productivity are hard to measure. Major jumps in productivity would be obvious. I think it’s clear that, if AI is affecting productivity, it’s to a minor degree at best.

i think it will make things go backwards.

the big leaps in productivity come from really great ideas that are formalised into concepts that then take form.

this comes from being in a meditative state. not blasting output at a higher rate.


Maybe. It also lets people build things that never would have existed before. My hobby is competitive pinball. There are multiple new stat and tournament tracking apps that have been vibe coded by people who never would have written code by hand.

So..?

If it was genuinely worth building before, you would have. Having some kind of cost involved is a force of nature that invokes one to decide whether it is worth doing it or not.

Moreover these activities only serve to enhance the wealth and interests of the few. Congrats. Don’t forget to look in the mirror.


It’s not operated for profit. People can just solve (some of) their problems by talking to computers.

What are you talking about? Llm producers are not a charity.

Obviously. You didn’t read correctly. Try again.

If it were 10x productive you'd be able to measure it indirectly, you'd be unable to avoid measuring it. So the initial claims were clearly lies. The research question is:

  Is it >1.0x productive?
I agree that's very hard to measure. But given what this shit costs, it had better be answerable, and the multiple had better justify the cost.

> You spend money today to build features that drive revenue in the future

Totally but new features in their app or better software are not going to increase Uber's revenue/profit significantly.


This is the message that somehow the tech industry is constitutionally incapable of absorbing. The "innovation impulse" is cancer. I have no idea why tech managers keep harping on about "innovating", it's so bizarre.

I mean, the option is not zero productivity or some productivity: it could be negative.

We doubt the productivity because we have enough experience with Claude Code to know that flooding your organization with that many tokens isn't just unproductive, it's actively harmful.


I think this site is doing a binary search, so that you narrow down on a boundary.

It would be much funnier, and also more insightful, if it didn't do this and let you contradict yourself.


Yeah, as I was toggling "blue" / "green" / "blue" / "green" I had the distinct sensation that it might just show me that I was in a region where I couldn't even make a consistent distinction.


Unless kids have gotten a lot faster in the past 25 years, I think that's a lot better than a typical 2000 person high school.


> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM’s key insight.

Interestingly, it was an elegant technique, but the proof still required a lot of work.


Which is difficult, because the fact that you can come up with your example questions tells us they're probably not very dangerous. Plenty of ink has been spilled about how LLMs could help people create bioweapons. The basic idea "you could do dangerous things with an LLM" is already pop culture, and you're not doing anything dangerous by giving easy example questions.

A dangerous question would have to be along the lines of "Could I use unobtanium with the Tony Stark process to produce explosives much more powerful than nuclear weapons?" so that the question itself contains some insight that gets you closer to doing something dangerous.

Perhaps the reason for not publishing the questions is twofold: 1) they want a universal jailbreak that can get the model to answer any "bad" question. 2) they don't want bad publicity when someone not under NDA jailbreaks their model and answers their question


> because the fact that you can come up with your example questions tells us they're probably not very dangerous

maybe I know more about this field that you think

there are biologists on video saying that present day models have expert level wet-lab knowledge and can guide a novice through whole procedures

models also were able to tweak DNA sequences to make them bypass DNA-printing companies filters

> they don't want bad publicity when someone not under NDA jailbreaks their model and answers their question

just like people now pay $500k for Chrome vulnerabilities, soon people will pay similar amounts to jailbrake models to do bad things


>there are biologists on video

Link handy?


[flagged]


Who are you quoting?


Well, yes. If you set up a nuclear lab in your house next door to me I'm calling the feds.

Things that are potentially dangerous to others when mishandled get regulated because some individual or some company abuses it and harms others.


That's a real question, maybe the changes are useful, though I think I'd like to see some examples. I do not trust cognitive complexity metrics, but it is a little interesting that the changes seem to reliably increase cognitive complexity.


I haven't previously thought about this, but I think words over a commutative monoid are equivalent to a vector of non-negative integers, at which point you have vector addition systems, and I believe those are decidable, though still computationally incredibly hard: https://www.quantamagazine.org/an-easy-sounding-problem-yiel....


Thanks, that's an interesting tidbit!

(The whole thing made me think about applications to SQL query optimizers, although I'm not sure if it's practically useful for anything.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: