Hacker Timesnew | past | comments | ask | show | jobs | submit | _fbpp's commentslogin

Given a set of instructions, an instruction fine-tuned/aligned LLM is able (conditional on size and training quality) to reason through a set of steps to produce a desired output.

This is plainly wrong. The model's growing size makes it better at guessing the outcome of a reasoning task, but little to no actual reasoning is performed.

It's trivial to prove this as well, as LLMs will still fail miserably at (larger) math problems that even basic computer algebra systems will handle with ease.


> The model's growing size makes it better at guessing the outcome of a reasoning task, but little to no actual reasoning is performed.

If there's no observable difference between the behaviours, why not call it as the post did?

> LLMs will still fail miserably at (larger) math problems

They're neither trained on such problems, nor is that a goal for LLMs. They can however tell you how to convert that problem into steps that can be run in an algebra system.


You're missing the point, there is a difference; The answers are often wrong, and more-wrong the more complex the question gets.

They're only able to answer simple (relative-to-the-model's-size) straightforward reasoning questions. Which is a nice party trick, but not broadly useful.

They can however tell you how to convert that problem into steps that can be run in an algebra system.

Usually they can't do that very well either. Converting a problem from one description to another is algebraic reasoning, subject to the issues already mentioned.

What they can do is summarize general instructions and documentation, provided adequate training data was available.

They're neither trained on such problems, nor is that a goal for LLMs

Yes. But LLMs keep being pushed for tasks that heavily involve abstract reasoning, which is dangerous as they're unsuited for it. (E.g. Any code generation that isn't mere empty boilerplate.)


OpenAI is experimenting with process supervision, which corrects many of these issues:

Problem Let x8+3x4−4=p1(x)p2(x)⋯pk(x),x8+3x4−4=p1(x)p2(x)⋯pk(x), where each non-constant polynomial pi(x)pi(x) is monic with integer coefficients, and cannot be factored further over the integers. Compute p1(1)+p2(1)+⋯+pk(1).p1(1)+p2(1)+⋯+pk(1).

Model attempt

Here, GPT-4 successfully performs a complex series of polynomial factorizations. The use of the Sophie-Germain identity in step 5 is an important step that could be considered insightful.

1

I notice that the given polynomial has even degree and only even powers of xx, so I can try to make a substitution to simplify it.

2

Let y=x4y=x4, then the polynomial becomes y2+3y−4y2+3y−4, which is a quadratic equation.

3

I can factor this quadratic equation as (y+4)(y−1)(y+4)(y−1), so the original polynomial is (x4+4)(x4−1)(x4+4)(x4−1).

4

Now I need to factor each of these factors further, if possible, over the integers.

5

For the first factor, x4+4x4+4, I recall the Sophie Germain identity, which says that a4+4b4=(a2+2b2+2ab)(a2+2b2−2ab)a4+4b4=(a2+2b2+2ab)(a2+2b2−2ab).

6

Applying this identity with a=xa=x and b=1b=1, I get x4+4=(x2+2x+2)(x2−2x+2)x4+4=(x2+2x+2)(x2−2x+2).

7

For the second factor, x4−1x4−1, I can use the difference of squares formula, which says that a2−b2=(a+b)(a−b)a2−b2=(a+b)(a−b).

8

Applying this formula with a=x2a=x2 and b=1b=1, I get x4−1=(x2+1)(x2−1)x4−1=(x2+1)(x2−1).

9

Then I can apply the difference of squares formula again with a=xa=x and b=1b=1, to get x2−1=(x+1)(x−1)x2−1=(x+1)(x−1).

10

So the original polynomial is (x4+4)(x4−1)=(x2+2x+2)(x2−2x+2)(x2+1)(x+1)(x−1)(x4+4)(x4−1)=(x2+2x+2)(x2−2x+2)(x2+1)(x+1)(x−1).

11

These are all the irreducible factors over the integers, so p1(x)=x2+2x+2p1(x)=x2+2x+2, p2(x)=x2−2x+2p2(x)=x2−2x+2, p3(x)=x2+1p3(x)=x2+1, p4(x)=x+1p4(x)=x+1, and p5(x)=x−1p5(x)=x−1.

12

To find the sum of their values at x=1x=1, I just need to plug in x=1x=1 into each factor and add them up.

13

I get p1(1)=12+2⋅1+2=5p1(1)=12+2⋅1+2=5, p2(1)=12−2⋅1+2=1p2(1)=12−2⋅1+2=1, p3(1)=12+1=2p3(1)=12+1=2, p4(1)=1+1=2p4(1)=1+1=2, and p5(1)=1−1=0p5(1)=1−1=0.

14

So the final answer is 5+1+2+2+0=105+1+2+2+0=10.

15

Answer: 10


It's an impressive result, but shouldn't be seen as "correction". Framing it as a (drastic) reduction in mistakes is more useful here.

If the model is productionized (read: dumbed down so it isn't as expensive to run), the reasoning abilities drastically decline again.

And these reasoning abilities are still around a language model, rather than around abstract models.

This is a very effective party trick for general math, whose language quite directly maps onto these abstract concepts, but there are some holes. Information about e.g. which values may be zero isn't encoded in the language, and so this approach is liable to blundering around division-by-zero issues.

If you want a particular example to toy around with, LLMs are not fond of quaternions and their conversion to other representations.


Which means they're this close in being able to reach out to an algebra system and run the steps and return you the result. I was just talking about this problem with someone the other day - how can it recognize that it doesn't have the answer but knows where it can get data so that it can form an answer. This seems to be the path Google is taking.


There's some argument to be made that a form of reasoning happens in a roundabout way when the AI is told to explain it's reasoning.

For example if you tell it "Do <thing>" and then open a new context and say "Do <thing>, explain your reasoning beforehand." you will often get a more accurate response.

Granted, it's not that any "Hmm, let me think about that." Deep Thought reasoning occurs, but simply that predicting what the reasoning would look like and then predicting what comes after that reasoning results in a more accurate - and ironically, reasoned - response.

Kinda funny actually, it's a bit like how in Hitchiker's Guide they just had to tell the probability machine to calculate the odds of an improbability drive in order to create it.


This is where the terminology becomes a bit annoying, but there is a key difference in the kinds of reasoning at work here.

When you ask LLMs to provide a reasoning, the actual reasoning performed is linguistic; The LLM has (is) a model about language and performs some (limited) reasoning on that model to get an output.

But that is explicitly different from reasoning about the abstract question at hand, thus the answer is mostly a guess.

The key difference to observe is that "semantic reasoners" like computer algebra or prolog, always maintain correctness within the axioms provided. They may slow down significantly as questions get more complex, but they do not start providing wrong answers. Computers are flawless mathematicians, provided they are programmed correctly.

LLMs do provide increasingly more-wrong answers as the question gets more complex. Thus we can observe that LLMs do not abstractly reason about the question and it's model.


>Thus we can observe that LLMs do not abstractly reason about the question and it's model.

Your conclusion makes no sense. Humans provide increasingly wrong answers as questions get more complex too. Jumping from that to "incapable of abstract reasoning" is silly. You have not "trivially proven" anything at all

>The LLM has (is) a model about language and performs some (limited) reasoning on that model to get an output.

LLMs generalize to non linguistic patterns.

https://general-pattern-machines.github.io/


Humans provide increasingly wrong answers as questions get more complex too.

Human this, Human that. LLMs aren't humans. "My model is crap but the human brain isn't very good at this either" is irrelevant when we have machines that are not only very good at these tasks but almost perfect at them.

Humans make such mistakes precisely because they are not perfect reasoning machines. To compare LLMs to humans is not only disingenuous, but proves my point.

(And no, I will not humour you with an argument about how the amount of wrong answers is drastically lower from human mathematicians)

Jumping from that to "incapable of abstract reasoning" is silly.

They are language models. It is explicitly what they are designed to do.

If these LLMs are not, as I claim, reasoning on language rather than the abstract model of the query, then how come they fail miserably in exactly the ways you would expect where that the case?

LLMs generalize to non linguistic patterns.

Yes, congratulations, if you turn a problem into a linguistic one LLMs can deal with them. This does not in any way go against what I said about the capabilities of LLMs.

The same levels of actual abstract reasoning can be achieved on a graphing calculator running off literal potatoes.


>Human this, Human that. LLMs aren't humans.

You said you trivially proved something and made up nonsensical lines of reasoning to justify it. If your "proof" can't port to Humans then it's not proof. You are just rambling.

>Humans make such mistakes precisely because they are not perfect reasoning machines.

Nobody is calling LLMs perfect reasoning machines. Your "point" was that they don't reason at all which of none of your ramblings has been able to "prove".

>If these LLMs are not, as I claim, reasoning on language rather than the abstract model of the query, then how come they fail miserably in exactly the ways you would expect where that the case?

They don't. The idea that you must make no mistake reasoning before you can be considered to be reasoning has no ground.

>LLMs generalize to non linguistic patterns. Yes, congratulations, if you turn a problem into a linguistic one LLMs can deal with them.

Can you read ? Did you even bother looking at the link? LLMs don't need patterns to be linguistic to reason over them lol. None of those patterns are turned linguistic. Some of them are arbitrary numbers that resemble nothing like the data they've been trained on.


If your "proof" can't port to Humans then it's not proof

Learn to take a hint. I'm not going to argue this on human terms because you're playing a dumb um-akshually game.

Computer reasoning systems can solve vastly more complex problems perfectly. Expert mathematicians can solve vastly more complex problems with only minimally increased errors. The ability of LLMs to solve reasoning problems completely disintegrates when the problems get more complex.

Trying to argue that LLMs are alike humans because of you can put these three into the buckets of "No mistakes" and "Some mistakes" is ridiculous.

Nobody is calling LLMs perfect reasoning machines.

Yes.

You said humans make mistakes, my point here is, humans make mistakes precisely because they stop doing reasoning and start doing blind pattern matching estimation of the answer.

The idea that you must make no mistake reasoning before you can be considered to be reasoning has no ground.

Reading comprehension.

I did not say no mistakes. I said that the failure pattern follows that of estimated guesses; Rapidly increasing errors as the size of the problem increases.

Whereas with computer reasoning, the rate of errors does not increase at all. And with (expert) humans the rate only goes up a little.

Did you even bother looking at the link?

You are missing the point.

I am not referring to literally English or any other language. I'm referring to the structure of language problems, which is vastly simpler than any moderately complex math or programming problem.

To more explicitly spell out the reason for my unimpressed-ness: They trained a pattern-repeating-machine and found that it will repeat some of their patterns, some of which were patterns trained on.

This does not demonstrate the ability to reason abstractly about new models, so I do not care.


Seems like a blurry line between "reason" and "guessing."

Kind of like how an educated guess by a professional is often more accurate than a well reasoned opinion of a layman.

The professional may not have reasoned it so much as intuited, but within that intuition is a lot of wisdom.

I suppose "predicting" is a more precise word than guessing or reasoning.

Guessing implies an arbitrary nature, reasoning implies understanding the concepts at some level.


You're assuming here that there has to be "real" value at the root. This isn't really true.

Astrology, Tarot, the I Ching, or any other kind of divination all serve the same purpose: To provide certainty where there is none. To measure the unknowable.

People fear the unknown and risk, divination lets them feel like they have some certainty about the future.

Myers Briggs, DISC, and all the other "personality tests" are the same thing, for contemporary times.

They provide no actual measurement of applicants, Myers-Briggs is especially easy to cheat and dubious in science.

The benefit is that managers feel like they're taking less risk when hiring, but that is mere delusion.


All of theses old ideas are simply ways to look at yourself from the outside. Any value is the insight it provides to oneself.


The issue with them is that it's not simply "looking at oneself".

If you were using divination for that purposes then it's no issue. Harmless superstition is fine.

But things like personality tests and other pseudoscience see regular use in hiring and promotion. And that's just ridiculous, damaging for both "honest" applicants and the company, as such processes favour dishonest people.


Do personality tests see regular use in hiring and promotion? What are some examples of that? My workplaces have offered those sorts of things as professional development, but not for direct promotion or hiring practices. I would be fascinated to see the outcome of a place that does use those things in that way.


Long ago the first place I worked developed a HyperCard stack for Myers-Briggs evaluations for a specific company. The company used it as tool to improve communications between existing employees. The purpose is to give everyone the same language. Fundamentally one is not in simply one category but can move through all the categories based on their current state and context.

Helping people have language to express this thoughts has value.

Misuse of any tool is a problem.


In my experience yes - At an earlier role, all employees were subjected to a DISC assessment at hire. This was at the headquarter office of a large real estate franchise. Results were kept in your file, and were a big component during reviews.

The biggest flavor-aid drinkers at the company used their assessment results as shorthand to either justify shitty behavior: "oh, person X is a 'High-D', of course that's why they co-opted the meeting, were abrasive, made everyone else feel small and insignificant". If you did not test with a high decisiveness level, it was absolutely brought up in promotion conversations. High-D, High-S etc. all became quick qualifiers to know where someone's career was headed.

Knowing strong dominance was likely an attribute valued by the company, I took the assessment with that in mind, resulting in a high dominance level (I'm probably middle of the road). That it was so easy to game made me loose all respect in their application of these assessments.


These are more like frameworks for imagining & interacting with the complexities of reality. Similar to an interactive Philosophy. Worldviews cannot be avoided & nobody holds a purely objective viewpoint...anyone who claims to hold an purely objective viewpoint is a liar or delusional.


> anyone who claims to hold an purely objective viewpoint is a liar or delusional.

Is this purely objective? Can you think of no exceptions, under any scenarios?


Confusing the subjective for the objective is a widely studied phenomenon. It's called reification and people generally deny doing it. There's a strong ego defense mechanism that raises when it's pointed out that shuts down conversation.


The utterly bizarre things is how so many (I would say, the VAST majority) smart people are unable to overcome it when discussing certain subjects.

Like sure, I can certainly understand the initial incident, heuristics are a bitch...but what is so bizarre is that when people are in this state, there seems to be literally nothing that can draw them out of it. I have done many, many thousands of experiments in this area, it is uncanny.


> They provide no actual measurement of applicants, Myers-Briggs is especially easy to cheat.

Myers briggs aside, I take issue with this specific argument in any context, just because a dishonest party can cheat a test, doesn't mean the test itself is worthless. I can do a math test with a calculator without knowing how to even do math, (oh, put this symbol next to that symbol and hit the = key?) doesn't mean the test is worthless.


I was once skeptical of the usefulness of personality tests, but reading Principles by Ray Dalio convinced me they can be used to build well-oiled organizations.


The I Ching is very ambiguous and open to interpretation, as Philip K Dock shows in The Man in The High Castle, or you can try for yourself. Whatever else it's doing, it is not providing certainty.


That's the trick. It's about feelings of certainty, not actual measurable reproduceable predictions.

Most long-lived divination methods are very vague. Anything providing concrete predictions is easily proven wrong and discredited, only the vague survives.

But people rarely take ambiguous answers for what they are, and instead interpret them into something more certain.

And this lets divination exploit all kinds of biases. On top of the regular old confirmation bias, whenever the interpretation turns out wrong, people don't write off the divination method, but assume they merely "interpreted it wrong" (and often, the vagueness means they can retcon an interpretation that is true), and worse yet, assume that now they're better at interpreting so next time it's going to be a correct prediction.

Observe how little the personality tests actually say, they're just as ambiguous.


Nitpick: The five-factor scale (extroversion, agreeableness, openness, conscientiousness, and neuroticism) both reproduces and makes good life predictions.

MBTI and most other personality tests unfortunately seem to be astrology for the scientifically oriented.


The correlation between MBTI and the Big-5 is surprisingly strong (except for neuroticism, which has no representation in the MBTI):

https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indi...

I look at personality typologies in general as approximations. If you actually want to get to know a person, get to know them. But when you need to make a snap judgment about how someone might see the world and react to you, on very little information, it's handy to have a general archetype as guidelines. You can fill in all the details later.

It's not unlike Carmack's fast inverse square root, a Bloom filter, or how Google hasn't used PageRank since 2006 (instead substituting a cheaper-to-calculate approximation). Yes, they give wrong answers. But the answers are usually close enough that when you lack the computing power to get a better result, they'll do.


The biggest issue w/ MBTI and others like it is the desire to put you into a category as an outcome. This is really bad for reliability of the test - take it again and change just a few answers, you may be in a different category altogether! But it's often great for sparking team discussions on norms, behaviors, etc. But the sheen of science and validity can be misleading.


That's reasonable.

As I see it, even assuaging fear of the unknown seems like it could be a valuable benefit. Especially if freedom from fear helps someone to make better decisions based on the information that they have.


I see this as why humans love stories so much. An explanation with a beginning, a middle, and an end. The human condition is facing the unknown.


Heuristics also provide the same benefits, and are the source of the "facts" in your comment.

Rationalism is a lot like astrology in many ways, interestingly.


A transformer can only memorize, it doesn't learn to do.

For what that concerns us here: LLMs will never learn to fact-check anything. They'll blindly regurgitate the facts they have been "taught", but never consider or evaluate "the paper cited for this fact on wikipedia is a bunch of bullshit".

Any attempt to use them to produce "facts" is ultimately just folly, in the same way Google's attempt to do so with it's search engine index is.


> [LLMs] never consider or evaluate "the paper cited for this fact on wikipedia is a bunch of bullshit".

Nor do people, though! This is setting the bar way too high.

The whole point to having edited reference sources like "encyclopedias" is that so that we can rely on the expertise of the editors in lieu of having to develop the expertise ourselves[1].

No, an LLM that simply knows a priori (via prompt hacking) which sources are trustworthy would be absolutely comparable to the way an educated-but-non-expert human approaches sources.

[1] Which is a chicken and egg problem anyway. Everyone starts with edited reference sources as tutorial material. Quite frankly everyone starts learning with wikipedia.


This is setting the bar way too high.

No. If these things are claimed to be sources of truth, then the bar needs to be that high.

It is precisely because people don't fact-check that the bar has to be so high.


> If these things are claimed to be sources of truth

That's a strawman, though. No service, nor human, "claims to be a source of truth" in the kind of profound sense you seem to be using. It stops, everywhere, at "Wikipedia (or whatever) said it and I trust it".

The only way to get access to deeper expertise is to (1) BE an expert and (2) engage in an discussion with another.


No, a transformer is a universal function approximator and is capable of learning to do anything to some degree of accuracy.

GPT doesn't do math correctly but it also doesn't just memorize it.


All this under the auspices of "ethics", which as a reminder, is just an arbitrary set of rules which someone is trying to pass off as having a divine origin.

They are arbitrary only in the sense that, sans a religion's God laying down ethics from on high, all ethics are arbitrary human creation.

But we care about ethics all the same. Engineers care about ethics because if they do not, they kill people. Civil engineers care because faulty work kills people. Electrical engineers care because faulty work kills people. Mechanical engineers care because faulty work kills people.

And as software engineers, we should care because our careless work kills people.

And if we do not care, the government will force our hand. And they will not listen to pleas for them to be reasonable.

Software engineers abused people's personal data, the EU's GDPR has outlawed using all but the absolute minimum personal data. Meta has been told in court by the CJEU that advertising is not an acceptable use of personal data, even if needed to pay the bills. Ad-tech is a doomed industry.

So it's your choice. Start caring about ethics. Or the government locks our field in so much regulatory gridlock that you will wish they just outlawed it entirely.


I wouldn't use "ethics" to describe any of the rules in place to ensure the quality of the designs or deliverables in civil/mechanical/electrical engineering.

I'm glad that there are building inspectors, and safety standards for consumer electronics. I'm also glad that those rules are written clearly in terms of lbs and amp*hours and volts, and not philosophical terms.

If you want rules, say you want rules, but don't try to hide the fact that some of your rules will have not objective justification, and will be based on the personal philosophies of those in charge.


your rules will have not objective justification, and will be based on the personal philosophies of those in charge.

The "objective" safety standards are often a lot more philosophical than you think. There is no "objective" truth for road design, it is a trade-off between how important you deem the safety of pedestrians and cyclists, versus the convenience and throughput of cars.

But also, just look at fields like journalism. Journalistic ethics exist because without them they kill people.

Not ratting out your anonymous sources isn't some technical requirement laid down in the physics of the universe, it's a philosophical belief.

And yes. Choosing to not take people's personal data is a philosophical belief. But the harm isn't philosophical.


For personal ethics, they are mere opinion as you get to choose what those ethics are. For professional ethics, those ethics are the opinions of the relevant professional associations and regulatory bodies.

They are facts in the sense that "The Bar believes that it is unethical to lie to the court" is a fact. It is simply factually true that the legal profession holds that ethics belief.

And thus the point, anyone who seeks to join such a profession has to accept their ethics "as if" they were facts. You can't choose other professional ethics in the sense that you can choose to hold a different opinion.


Remember that database rights are a thing.

One cannot hold copyright facts, but one can "copyright" a collection of facts like a search index or a map.


But it isn't different. People have been using things like Markov chains to experiment with NPC dialogue for well over a decade.

It just never got widespread adoption because it's just not interesting, and LLMs are no different here. The dialogue is still empty, despite being deeper and more grammatically complex than previous attempts.

If every farmer in an RPG hands out the same "collect 20 bear asses" quest it doesn't matter if they all have "detailed" randomly generated backstories and can opine about the game world, real world philosophy, or the 2024 US elections.


I actually think it makes a world of difference to opine about the game world. It's so much more immersive.

Have you ever gone to a living history museum? (Old Sturbridge Village is one example, my favorite I've been to). All these people in character, able to talk about the period, it makes for an amazing experience.

In traditional video games, if we try to or even accidentally push any deeper, we see the cracks in the universe. "Oh, I spoke to this person again, and they said the same thing to me." AI can help fix those cracks, and fill them in wherever the player ventures.

This certainly doesn't change Fortnite, but I think it could change immersive RPGs and MMOs.


"Living History" is a well crafted written experience, not procedurally generated slop.

The issue here is that LLMs can only act in-character if the world has already been built and written, if the prompts are so pre-chewed that you may as well just write the dialogue directly and get even better results.

Take Solaire of Astora. He's an interesting NPC not because of any depth of the dialogue, but because of how well in-tune he is to the world and game itself. A true believer in the old god, a beacon of optimism in a depressed dying world, and someone who sets the tone of the co-op multiplayer to be silly and fun.

You can't get that out of an LLM.


“‘Living History’ is a well crafted written experience, not procedurally generated slop”

Having known people who lived/worked at a living history museum, their experience was much closer to improvisational comedy than a scripted interaction. Sure, they were riffing on their historical knowledge instead of cracking jokes, but it was not scripted.


> It just never got widespread adoption because it's just not interesting.

True. There have been NPC systems where the NPCs had motivations and a life of their own, even when no one was around. Those haven't helped gameplay much.

The current problem is that LLMs don't know enough about the game world. Recent progress on that.[1]

[1] https://arxiv.org/abs/2304.03442


The fun part is that the GDPR already does. The answer is you're not allowed to use personal data for AI. (And "personal data" here covers things like all public social media posts)

Facebook recently got told by the CJEU that, no, they can't use people's posts to target advertisements. Even if those ads are what's paying for the platform. That you can't claim such processing as "part of the contract" unless it is absolutely necessary in the same way the post office needs an address to send a parcel.

If Facebook can't even do that, there is no way LLMs will be allowed. (And remember. The GDPR does not care if your system doesn't distribute personal data. Any kind of processing at all falls under the GDPR's requirements)

OpenAI is already being chased by the EU's privacy agencies. Right now they're in the process of asking pointed questions, things will heat up after that.


End result: EU AI enjoyers use a VPN plus a US-based credit card borrowed from a friend.


The entire fair use claim is derived not from any legal basis, but rather, that "it has to be fair use" because it would be legally catastrophic for OpenAI et al if it weren't true.

If you look at the core argument in favour of fair use, it's that "LLMs do not copy the training data", yet this is obviously false.

For Github copilot and ChatGPT examples of it reciting large sections of training data are well known. Plenty can be found on HN. It doesn't generate a new valid windows serial key on the fly, it's memorized them.

If one wants to be cynical, it's not hard to see OpenAI/etc patching in filters to remove copyrighted content from the output precisely because it's legally catastrophic for their "fair use" claim to have the model spit out copyrighted content. As this is both copyright infringement by itself, and evidence that no matter how the internals of these models work, they store some of the training data anyway.


It actually doesn’t even matter if LLMs reproduce copyrighted data from their training. The issue is that a human copied the data from its source into memory for use in training, and this copy was likely not fair use under cases like MAI Systems.

The Supreme Court hasn’t ruled on a software case like this, as far as I know. But given the recent 7-2 decision against Andy Warhol’s estate for his copying of photographs of Prince, this doesn’t seem like a Court that’s ready to say copying terabytes of unlicensed material for a commercial purpose is OK.

I’m going to guess this ends with Congress setting up some kind of clearinghouse for copyrighted training material: You opt in to be included, you get fees from OpenAI when they use what you added. This isn’t unprecedented: Congress set up special rules and processes for things like music recordings repeatedly over the years.

https://scholarship.law.edu/cgi/viewcontent.cgi?referer=&htt...


How does that align with Google Books scanning libraries full of copyrighted text, offering full reproductions of sections of the work, and then having the supreme court declare it all to be Fair Use? I think that is a far more relevant precedent here: https://en.m.wikipedia.org/wiki/Authors_Guild,_Inc._v._Googl....


The Supreme Court declined to hear the case on appeal, which is a shade different from endorsing the decision after a hearing.

That being said, it doesn’t take a lot of effort to differentiate these cases. Google was indexing copyrighted works and providing access to limited extracts. They weren’t transforming them into new works and then selling access to those new works over APIs.


OpenAI is also providing access to limited extracts. Google wasn't selling this over an API, they were providing "free" access to it while displaying ads to the user. Would the courts see this manner of monetization to be different enough that settled case law wouldn't apply?


OpenAI isn’t doing anything like what Google was doing with Books. It’s not hard for laymen to see that, and it’s going to be obvious to any judge who hears a case.

Imagine OpenAI had invented a software program that turned any written text into an animated cartoon enacting the text. That would obviously be creating a derivative work and outside fair use bounds. That they mix a bunch of works (copyrighted and otherwise) into a piece of software doesn’t allow them to escape that basic analysis.

Google showed a “clip” of the original work, no different in scope than Siskel & Ebert showing a clip of a film as they reviewed it. The uses are not comparable.


Google also bought copies of each book, I believe, which makes it another step removed from standard ML practice.


So how is that supposed to work with people sending it legally obtained copyrighted materials for an analyze?


That copy (the “send”) would be evaluated under the same fair use criteria.

“Write a review of this short story: …” – probably fine.

“Rewrite this short story to have a happier ending: …” – probably not.


OpenAI's bias research on DALL-E revealed that most examples of regurgitation come from repeated copies of the same image in the training set. When they filtered out duplicates, DALL-E stopped drawing training examples.

The problem is that filtering the training set is naively O(n^2) and n is already extremely large for DALL-E. For LLMs, it's comically huge, plus now you have to do substring search. I've yet to hear OpenAI talk about training set deduplication in the context of LLMs.

As for the legal basis... nobody's ruled on AI training sets in the US. Even the Google Books case that I've heard cited in the past (even by myself) really only talks about searching a large corpus of text. If OpenAI's GPT models were really just a powerful search engine and not intelligent at all, they'd actually be more legally protected.

My money's still on "training is fair use", but that actually doesn't help OpenAI all that much either, because fair use is not transitive. Right now, such a ruling would mean that using AI art is Russian roulette: if your model regurgitates, the outputs are still infringing, even if the model is fair use. Novel outputs aren't entirely safe, though. A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

This logic would also apply in the EU. Last I checked the TDM exception only said training is legal, not that you could sell the outputs. They don't really respect jurisprudence the way the Anglosphere obsesses over "precedent", so copyright exceptions are almost always decided by legislatures and not judges over there, and the likelihood of a judge saying that all outputs are derivative works of the training set regardless of regurgitation is higher.

[0] In the sci-fi novel Dune, the Butlerian Jihad is a galaxy-wide purge of all computer technology for reasons that are surprisingly pertinent to the AI art debate.

Yes, this is also why /r/Dune banned AI art. No, I have not read Dune.

[1] If the opinion was worded poorly this would mean that even human artists taking inspiration to produce legally distinct works would be violating copyright. The idea-expression divide would be entirely overthrown in favor of a dictatorship of the creative proletariat.

[2] "Music and Film Industry Association of America" - an abbreviation coined for an April Fools joke article about the MPAA and RIAA merging together.


> A judge willing to commit the Butlerian Jihad[0] might even say that regurgitation does not matter and that all AI outputs are derivative works of the entire training set[1].

A judge can’t “commit” the butlierian jihad. A jihad is a mass event caused by some fraction of the population believing in some cause.

Which kinda gets to a point that seems to be missed. Copyright law is not “intrinsic” - nobody thinks that copyright is a natural law - it is just a pragmatic implementation which balances various public and private goods. If the world changes such that the law no longer does a good job of balancing the various goods, then either the law will get changed or people will ignore the law.


Copyright is a unique case in which the law represents a bargain struck in the 1970s that hasn't been updated since. Everyone ignores it because it's nearly impossible to actually enforce copyright on individual infringers. But that doesn't mean copyright is meaningless: any activity which is large enough to be legible[0] to the state will be forced to bend itself to fit within the copyright bargain.

And AI training is extremely legible. This is not like a bunch of people downloading stuff off BitTorrent. All of the large foundation models we use were trained by a large corporation with a source of venture capital funding which could be easily shut off by a sufficiently motivated government. Weights-available and liberally licensed models exist, but most improvements on them are fine-tuning. Anonymous individuals can fine-tune an LLM or art generator with a small amount of data and compute, but they cannot make meaningful improvements on the state of the art.

So our sufficiently motivated copyright judge could at least effectively freeze AI art in time until Big Tech and the MAFIAA agree on how to properly split the proceeds from screwing over individual artists.

"Butlerian Jihad" is a term from a book, so you don't need to take "jihad" literally. However, I will point out that there is a significant fraction of the population that does want to see AI permanently banned from creative endeavors. The loss of ownership over their work from having it be in the training set is a factor, but their main argument is that they specifically want to keep their current jobs as they are. They do not want to be replaced with AI, nor do they want to replace their existing drawing work with SEO keyword stuffed text-to-image prompts.

[0] https://en.wikipedia.org/wiki/Seeing_Like_a_State


Butlerian jihad is a good reference point. Something so bad happened that a large enough portion of the population was convinced to destroy thinking machines, and this no-computer norm was held in human society for a crazy long time (been too long for me to remember how long elapsed before Chapterhouse, which I think is the book where thinking machines start returning). It was a core belief of humanity that computers were bad, not a law imposed by a judge or legislature.

So say a US judge did impose severe restrictions on LLMs through US copyright law. The giant companies that are using LLMs will just move to another country. And just like tax law, others will be happy to have them. Would the US start blocking inbound internet traffic from countries that don’t have the same interpretation of copyright? That seems very unlikely.

The point is that the only way LLMs get the butlerian jihad treatment is if the people rise up against them. Right now, that is nowhere close to happening.


> The problem is that filtering the training set is naively O(n^2)

There are standard ways to do it that are O(n), FYI.


Prompt "engineering" is just writing prayers to forest faeries.

Whilst BASIC/JavaScript/etc are all magic incantations to a child, a child will soon figure out there's underlaying logic, and learn the ability to reason about what code does, and what certain changes will do.

With prompts, it's all faerie logic. There is nothing to learn, there are only magic incantations that change drastically if the model is updated.

Worse yet, the incantations cannot be composed. E.g. take the SQL statement "SELECT column FROM table WHERE column = [%s]". For any given string you insert here, the output is predictable. You can even know which characters would trigger an injection attack.

With prompts you cannot predict results. Any word, phrase, or sequence of characters may upset the faeries and cause the model to misbehave in who knows what way. No processing of user-input will stop injection attacks.

Whilst it's dubious to call current software development practices "engineering", it's utterly ridiculous to do so for prompt-writing.


I don't get where this sentiment comes from. I build software specifically on the concept of predictable results from llm's being composable.

Sure, the results are not deterministic in that 100% of the time the exact prompt returns the exact same result, but you can tune your prompts so that 100% of the time they give you a valid result in the result category you were seeking, and with a specific probability distribution of available choices.

Prompts are functions that can take concrete input and create a probabilistic output that can be automated upon. Especially if you only need to output one token, i.e a number, boolean, word, object reference. And for obvious reasons - the further you forecast out in a sequence the less accurate you will be.

As long as you don't change the underlying model, in a massive model with billions of parameters, there are definitely mechanisms and behaviors to discover that you can reason about.


but you can tune your prompts so that 100% of the time they give you a valid result in the result

You can't though, that's the issue. Illustrative here are tokens like "SolidGoldMagikarp", but this does happen to "normal" sequences of tokens as well.

There is no filter you can build to keep out such mistakes, any set of otherwise normal tokens could trigger the model to produce wrong output.

Because of how large these models and most prompts are, even slight changes in things like attention can cascade into extremely different results.

there are definitely mechanisms and behaviors to discover that you can reason about.

It's faerie logic. The behaviours are mere trends and observations, not underlaying truth.

The faeries reward you for offering them fruit. But offer them apple which fell from the tree exactly 74 hours ago down to the second and they'll kill you. There is no way to know ahead of time which things will upset them.

The risk here is that you're fooled into believing these systems are understandable, that you know how they work, and that you'll mistakenly use them for something where the wrong results have consequences. You'll stop double-checking the output, all humans are lazy like that, and then you'll have disaster on your hands.


You can reasonably expect an LLM to respond appropriately often. Which percentage of the time depends on the details, but it’s not much more magic than expecting the bridge you built to hold up.


you could do a sort of validation of output by prompting the llm repeatedly with the same prompt and then compare the responses to eliminate outliers. I do feel like this stuff is magic though, just wanted to provide a counterpoint.


In "The Information," James Gleick discusses a concept related to our current discourse. In the days when computers were merely an array of switching circuits, luminaries such as Claude Shannon believed that "thinking" could be captured in a structured format of logical representation.

However, even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists. Languages evolve over time; Python, for instance, with its various imports that constantly disrupt my code, serves as a good example. This is perhaps the reason behind the emergence of containers to ensure code consistency.

While some elements may be more "composable" than others, it appears increasingly unrealistic in today's world to encapsulate thought processes or interactions with systems within a rigid logical framework. Large Language Models (LLMs) will keep evolving and improving, making continual interaction with them unavoidable. The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.

I firmly believe that any effective system should incorporate a robust user interaction component, regardless of the specific task or problem at hand.


It's not so much about formal logic, but general predictability.

even with formally composable languages like JavaScript, a semblance of unpredictability — akin to the "faerie logic" metaphor — still persists

And they're ridiculed for it, and as you state, we design around them or replace such systems entirely.

making continual interaction with them unavoidable

Technology is never unavoidable or "inevitable". We can choose not to use it, or when to use it.

The notion that we can pass a set of code or words through them once and expect a flawless result is simply illogical.

Yet that is what we expect when we put these systems into production use, especially when many proposed use cases are user-facing and subject to injection attacks.

Whether it be the writing of adcopy, the processing of loan applications, or generating code, mistakes in these tasks have very real consequences.


I don't disagree we can choose to use it or not, but my point was more meant to indicate that, if we want a good experience with LLMs, we have to continue to interact with them to achieve good results.

Reminds me of raising kids...


You're too right.

We need to move away from prompt-engineering - it's AI-Management. You pretend you're speaking to another (albeit confusing/confused) person when extracting work from a model. You're coaxing things out of it based on hearsay and mysticism that work most of the time. Sounds a lot like AGILE and free pizza to get a junior to stay late and deliver on time.

That's not engineering, that's management.


It’s so refreshing to see someone actually write this about prompt writing. It makes an extremely refreshing change from Twitter AI influencers posting their ridiculous prose as some marvel of harnessing LLMs.


You cannot predict results in _any_ domain with 100% accuracy, especially not in most engineering domains.

Why do you think rockets explode, bridges collapse, etc.


This was magical really made my day. Thanks for this.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: