Hacker Timesnew | past | comments | ask | show | jobs | submit | rickmode's commentslogin

Isn’t it the cheapest AWS region? Or at least among the cheapest. If I’m correct, this incentivizes users to start there.


All countries are more expensive than the US, but many US regions are the same (cheapest) price. (eg Ohio, Oregon).

us-east-1 is more a legacy of it being the first region so by virtue of being the default region for the longest time most customers built on it.


See a psychologist or counselor. If you an underlying diagnosis, you could be suffering more than necessary.


Pronoun confusion. The second pronoun is ambiguous.

Since we are all “hackers” here, I’ll be pedantic…

“While it is a great language…”

The “it” pronoun clearly refers to the C++ language, as I’m sure you intended.

“…ir would profit from less ‘lets code C with C++ compiler ’ attitude.”

The “ir” — presumably a typo for “it” — can refer to the article or C++. Given that this thread is about an article, the second “it” referring to the article is a natural assumption.


It with typo refers to C++.


In context, this means “self-inflicted harm”.


I’d say it’s like the mental health distinction of a trait versus a disorder: if this is negatively impacting your day to day life, consider therapy / counseling.

(As an aside, I think of counseling as “optimizing my life”. Perhaps that framing may help those that find the idea off-putting.)


To be pedantic, this would be violet. (Purple is a mix of red and blue light.)


More pedantic? I had an art teacher tell me purple isn't even a color — it's the name of a dye. (As you say, she said violet is the correct name.)


Violet is spectral: it's a frequency at the short end of the spectrum, and a shorter frequency than our blue cone cells are tuned to detect. So really it should appear dark blue, but in fact when we look at it the red cones, which are activated by long wavelengths, send some amount of signal too. This is due to a bug.

Purple, meanwhile, is a genuine mixture of short and long wavelengths, causing the same response.


> This is due to a bug.

I am fascinated by the quirk that our red cones also slightly perceive light beyond blue. Thanks to this quirk we can have the model HSV which wraps around the red hue to magenta and then blue. This is only possible because of this quirk.


Naive question: why not remove all sensitive data, or all data, from the notification and leave the context for a secondary API call?


Yup that is also a great way. Just send a message ID and fetch the actual content in the notification extension that can pre process incoming notifications.


I may be misreading the article, but does it seem like it alludes to metadata- and timing-related analytic techniques, rather than the contents of the notifications?

“...for metadata related to push notifications to, for example, help tie anonymous users of messaging apps to specific Apple or Google accounts.”

So maybe more, they (or somebody) send some messages to this account they want to ID, then request the specific device identifier that received notifications for that app at all of those times?

Would obfuscating the content make much difference with respect to that category of technique?


Maybe a bit more hand for the woman. She’s got a bit of a stump going on there.

But otherwise I like the image!


I believe we first need to answer the question of whether the copyright of the AI model’s source text or images affects the output.

My opinion — and note I’m a software engineer, not a lawyer — is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material. This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation. And further, a user of the AI would themselves require a license to use the output.

The alternative seems to be “anything goes”.


I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

A model trained on several copyrighted data sources cannot somehow be used in a way depending on a subset of those sources.

So all parameters of usage and compensation should be settled by contract between the model builder and copyrighted data supplier, before the copyrighted material is used.

Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

That’s it. That’s the standard. No complicated new laws required.

Model builders obtain permission to use copyrighted material from copyright holders based on any terms both agree to.

Terms might involve model usage limits, term limits, one time compensation, per use compensation, data source credits, or anything else either party wants.

The likely result will be some standard sets of terms becoming popular and well known. But nobody has to agree to anything they don’t want to.


I slightly disagree, in that I think the person using the tool should bear the burden of copyright. I.e. if the model outputs something under copywrite it merely can't be republished. In this same way, i can use Photoshop on proprietary data but I can't necessarily sell the results.


I'm so torn. On one hand, what you suggest seems to be a nearly ideal balance between advancing scientific progress and legal liability. By placing the legal burden to publish generated works on the person actually trying to publish, it allows for a more nuanced legal approach (i.e. the difference between "there are similarities to this work, but it's murky" or "you %100 stole that work").

On the other hand, is the company running the model themselves not already publishing all of that work and profiting from it? It seems unfair that their bottom line gets to be bolstered because they can produce work based on any artist, whereas the consumers of that work may need to end up walking on egg shells in order to publish them.

Like I said, I'm torn as far as how it "should be". I know how I want it to be though. I would love if AI continued training unabated. The results have been amazing, and I believe it would be a shame if the effort was slowed down by legislation.


> is the company running the model themselves not already publishing all of that work and profiting from it?

no, because the model is transformative enough that it cannot be said to be a derivative works of the training set.

The model is in essence a form of distilled information, extracted from the training set. Information cannot be copyrighted - only expressions can.

Therefore, a model producer should have the right to use any pre-existing work, in the same way a person can, to study and internally memorize and extract information.

The reproduction of any of the training set data constitutes a copyright violation, but this is not done by the owner of the model, but by an end user of the model.


My point is that if a court finds that a generated image is indeed similar enough to constitute an infringement when a subscriber of for instance MidJourney attempts to publish it, has that work not already been "published" to the subscriber? And has MidJourney not profited by gaining a subscriber based on the work of others?


I wonder if that analogy represents the same thing. Speaking purely from a non-legal perspective on the ethics in my mind:

When you use Photoshop on propriety data you're providing the original data and choosing what manipulation to make (i.e. what tool) and directly creating the output. It makes sense that if you redistribute this it may be copyright violation.

When you use Copilot or ChatGPT for programming you're typically asking a non-proprietary question or accepting suggestions it's making based on non-proprietary (or proprietary to you) code in the file. You also don't dictate the manipulation process a black box deep learning model does (i.e. I haven't asked it to do something that could be reasonably thought to be a copyright violation).

Am I then responsible for the fact that Copilot is fooling me with effectively copy-pasted copyrighted code when it's being presented to me as generated by the software and I haven't instructed the software to commit a copyright violation? I'm not sure if intent matters for copyright, I assume it doesn't but perhaps that's a missing piece to this.

Diffusion models are gray to me, if you're asking/prompting with "Mickey Mouse riding a horse" I can see the argument that the prompt itself can be interpreted as asking the model to commit copyright violation and the user is just hiding behind a layer of abstraction. If I ask the model to spit out "a picture of a smiling cartoon woman" and it generates a Betty Boop lookalike is that still the users fault?

It seems to me like passing the burden to the user could be reasonable but would need some safe harbor type of exception. It'll be really interesting to see what the courts decide.


I see 2 problems with that.

(1) how do you know if the image that just generated is substantially similar to an existing copyright work? Maybe if some registration tool existed, but other wise the burden is too great

(2) what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.


> how do you know if the image that just generated is substantially similar to an existing copyright work?

This is already a problem with biological neural nets (i.e. humans). I remember as a teenager writing a simple song on the piano, and playing it for my mom; she said, "You didn't write that -- that's Gilligan's Island!" And indeed it was. If I had made a record and sold it, whoever owned the rights to the Gilligan's Island theme song could have sued me for it, and they would (rightly) have won.

There's already loads of case law about this; the same thing would apply to AI.

> what is stopping someone from generating millions of images and copy righting all the "unique" ones? Such that no one can create anything without accidental collisions.

Right now what's stopping it is that only humans can make copyrightable material; whatever is spat out from a computer is effectively public domain, not copyrighted.


1. lots of established law and case law (at least in the US), this is already a well-settled problem and folks have the tools and proper venue to bring infringement claims. Yes, federal copyright infringement litigation is prohibitively expensive for many issues. There is a now a "small claims court" for smaller issues. [1]

2. Those works cannot be copyrighted (at least in the US). [2]. And hey, someone already tried copyrighting every song melody [3]

[1]: https://copyright.gov/about/small-claims/

[2]: https://www.federalregister.gov/documents/2023/03/16/2023-05...

[3]: https://www.youtube.com/watch?v=sJtm0MoOgiU


But that problem is already solved.

Copyright holders are already protected from (I.e. can legally prohibit) distribution of obvious copies, or clearly derivative works.

Regardless of how they were produced by hand, copy machine, Photoshop or with a model.

The new problem is that artists styles are being “stolen” by incorporating their copyrighted work into models without their permission.

And that problem can easily be solved if using copyrighted material to create models is declared NOT fair use.

Artists could still allow models to be built from their work, but on their terms. If they wish to do that.

A famous artist, that doesn’t mind being commercial, could sell their own unique model to let fans create art in that artist’s style, while not having their style “ripped” by others.

Or just keep their style to themselves, for their own work, as artists have done for centuries.

(Of course, with greater effort, their style could still be recreated - styles are not protected unless they are trademarked - but the recreation would have to be done without using the artist’s copyrighted works.)


This is probably a somewhat unpopular opinion on HN, but it is where many of the artists I work with are generally trying to get to. Consent, compensation, and credit.


> Consent, compensation, and credit.

I just want to quote you. Nothing I need to say. That’s it.


This is the best path forward I think. And it will become increasingly sensible as things continue to evolve. AI wasn't necessary to violate copyright before, and it isn't necessary today.

The determination of copyright violation should be made against the output of the model in the event that someone uses it for commercial purposes.

If the models have a risk of generating copyrighted content, it will be up to the consumers of the system to mitigate that risk through manual review or automated checks of the output.


A divergence, but I see a lot of posters asserting that "humans learn by copying other people, but we don't call that a violation of copyright when they draw"

People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?


> People casually asserting that software is equivalent to humanity will be a non-negligible thing to consider, as irritating and poorly-founded as it seems.

> If the reproduction isn't pixel-perfect, but merely obvious and overwhelming, how do you refute that philosophically to people who refuse a distinction between 50GB and a human life?

Software equivalence to humanity is a very philosophical question that many sci-fi writers have approached. But our primary issue related to this technology does not depend on anyone making a determination there.

The challenge is that losses to livelihood from this technology are going to come from far broader impacts than copyright alone. Copyright disputes are just the first things to get everyone's attention.

Let's say we err on the side of protection of copyright, and all training data must be fully licensed, in addition to users being responsible for ensuring outputs did not accidentally reproduce something similar to a copyrighted work, even if it was part of the licensed training dataset. Great! This fixes the problem of lost value for the owners of copyrights. Companies will face a slight delay and slightly increased costs as they license content; however, in the end, model capabilities will be the same and continue to increase.

The number of jobs that actually cannot be performed without humans will continue to dwindle — livelihoods will be lost at essentially the same scale despite upholding copyrights.

The only way we can handle a technology capable of reducing most need for human labor is by focusing on planning and executing a smooth transition toward an economy with more people than jobs — aiming for minimal human suffering during this process.

A mass loss of human jobs does not need to mean a mass loss of livelihood if our society is prepared to transition to a universal basic income. After all, human life is far more than just a job. We have the opportunity for much more fulfilling lives if we plan this transition well. We must understand that this is a far larger issue than copyright - copyright disputes are just one of the first symptoms of this disruptive process.


A human is still entering the prompt to generate the possibly copyrighted image/text. I don't think copyright law should care about the implementation. It's ok to copy a style if you use paint brushes or photo shop. But not ok if you use a statistic model?


Apply for a copyright on your human authored prompt then. That's the extent of human authorship.


> Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

The more I think about it, the more something along these lines seems like it might be the right way to think about it.

When you play a DVD, for example, you copy the bits off the DVD, into the memory of your DVD player, and onto your screen; this is all explicitly considered "fair use" copying. But if you then copied those fair-use bits off the screen onto a thousand other screens, that violates copyright.

When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

We could make the law for LLMs follow a similar logic: That having an LLM watch a video or read a text is similar to having a DVD player read a DVD or a web browser copy information from a website. It's good for that limited use case, but the resulting copy cannot be copied again without a license.

This would allow (say) researchers, or even individuals, to do their own training and so on without a license; but when anyone wanted to create something that they wanted to scale up, they'd have to get licenses for everything.

That would fundamentally keep things balanced as they are now with creators and other creators. The big problem isn't that a handful of other creators may be copying their style; that growth in competition is limiting because of the expense of duplication. It's that millions of electronic engines can copy their style.


> When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

If you ripped The Little Mermaid, redrew every frame to combine it with The Fresh Prince of Bell-Air and moved things around in scenes to make it look like Ariel is Will Smith responding to sit-com dialogue, then it'd be fair use, regardless of how many people you show this new version to.

Fair use isn't about how or why you're doing with something. The definitions for fair use are very clearly laid out at https://www.law.cornell.edu/uscode/text/17/107


> I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

I'm torn on who should pay, and where and when. In the world of patents, there's often an option/split. Say a chip manufacturer wants to build H265 decoding into their hardware. The chip manufacturer could buy the license. Or the purchaser (who probably is building some sort of board or device around the chip) could pay for the license. Or they could disable that functionality in the end product, and the consumer could pay for a license (or not, if they don't care about that feature).

The most common is usually the middle option: the end-device manufacturer (or brand that eventually sells the product) will pay for the license.

But I'm not sure if this works all that well for an AI model. With hardware, the license is usually paid per unit. It's easy to see that one chip = one license. If the model builder buys a license, that model could be used one time or 100 million times. Tracking use like that probably isn't all that practical, but I think it's safe to say that a 100-million-use model should probably pay more for a license than a single-use model.

So maybe the model builder should be responsible for attaching a comprehensive "copyright history" to the model, and users should have to pay for a license based on their use? Again, not sure how to track that. But I guess general software licensing has similar problems when you can "hide" usage.


Yes, someone using a model can’t know if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize. If use of the output of these systems comes at significant legal risk then then such systems become nearly useless.


> if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize

how does the industry today deal with artists that "copy" off some other works? This isn't a problem with AI at all - just that AI provides a tool to generate such works faster.


Someones comes to me to ask for a drawing of Batman or to write an erotic story around Supergirl. I can do it, but I cannot claim ownership over the characters. And I think I will quickly get a letter from DC or Marvel if I try to do this at scale.


> I can do it, but I cannot claim ownership over the characters.

of course not. But you can claim ownership if you don't call those characters their original names, and make sufficient changes to the design (how sufficient is determined by a court of law - thus expenses).

> DC or Marvel if I try to do this at scale.

The show 'invincible'[1] has a character that is a basic copy of superman. And yet, you will find that they don't get a letter from DC.

[1] https://en.wikipedia.org/wiki/Invincible_(TV_series)


> make sufficient changes to the design

I think that’s one of the issue. The transformation done by these tools are mechanical. Even if it may be extensive. The human input is too small. Omniman may have similarities with Superman, but he is not him in the larger context of the story. LLMs can not yet be that consistent for marketable output that deserves to be copyrightable.

I’m perfectly fine for LLMs to aid with spell checking and alternative phrasing (image is a grayer area). Bu the ideas of prompts and prompt output being copyrightable is something I oppose.


> The human input is too small.

That's a huge assumption, especially for image generation models.


Why shouldn't a prompt output be copyrightable?


Because prompts lack sufficient creative control.

Typing a search sting into Google doesn’t provide copyright over its output.


> lack sufficient creative control.

the prompts have become somewhat creative these days. If you have a look at the prompts on https://civitai.com for example, you can argue they are a form of creative expression. Just like hand rolling assembly code might be.

Edit: an example one - https://civitai.com/images/2268828?collectionId=107&period=A...

and the associated prompt:

  High detail, dynamic action pose, masterwork, professional, fantasy, neo classical fine art, of a beautiful, primordial and fierce, ((angel-winged-woman,:1.9)), archangel, (MiddleEastern:1.6), with very long, flowing, wavy white hair, peach colored streaks, with a sexy, slender, fit body, wearing an ethereal, light violet, light aqua, faded gold, tie-dye, linen and Chantily lace, (knee length:1.5), strapless dress with a tattered hem, a Platinum and gold Cuirass, platinum vambraces, platinum and lace Gladiator Boots,  long broadsword in a Baldric, at night, in a metropolis warzone, during a thunderstorm, dimly lit, thin, vibrant streaks of crimson light, outlining her body, fantasy illustration,  in the style of Osamu Tezuka, George Edward Hurrell, Albert Witzel, Hiromitsu Takeda, Clarence Bull, Gil Elvgren, Ruth Harriet Louise, Takaki, Milton Greene, Huang Guangjian, and Cecil Beaton,, High detail, dynamic action pose, masterwork, professional, fantasy, neo classical fine art, of a beautiful, primordial and fierce, ((angel-winged-woman,:1.9)), archangel, (Columbian:1.6), with very long, flowing, wavy white hair, peach colored streaks, with a sexy, slender, fit body, wearing an ethereal, light violet, light aqua, faded gold, tie-dye, linen and Chantily lace, (knee length:1.5), strapless dress with a tattered hem, a Platinum and gold Cuirass, platinum vambraces, platinum and lace Gladiator Boots,  long broadsword in a Baldric, at night, in a metropolis warzone, during a thunderstorm, dimly lit, thin, vibrant streaks of crimson light, outlining her body, fantasy illustration,  in the style of Osamu Tezuka, George Edward Hurrell, Albert Witzel, Hiromitsu Takeda, Clarence Bull, Gil Elvgren, Ruth Harriet Louise, Takaki, Milton Greene, Huang Guangjian, and Cecil Beaton,


That’s a perfect example, they said “during a thunderstorm” does that image look like it’s in a thunderstorm? Sure, the output of the prompt relates to what was said, but they influenced the output rather than controlled it.

Further, it’s well known that simply telling an artist what you want even including quite detailed descriptions isn’t enough to get copyright over the resulting image.


The difference is the artists assertion that it’s either original or a copy from something else. DALLE 2 can’t tell you if it’s original or not. These AI’s have no idea and the company or group that created them doesn’t review individual output so they can’t say either.


> DALLE 2 can’t tell you if it’s original or not

whoever pressed the button to run DALLE will make the assertion, just like whoever was running photoshop to make the image today would make the same assertion.


Based on what?

A photoshop user controls what data photoshop uses, a DALLE user doesn’t. Even a prompt as generic as “Cat” could be producing an obviously derivative work if you compare it to the original. This is true for all prompts.


> A photoshop user controls what data photoshop uses

the point was that the user of the program is making their declaration, whether it's photoshop or DALLE. How does the business verify that their staff artists aren't producing copyright infringing material, just from memory?

The liability falls to them to verify the copyright status of the output they're asked to make. A business paying a photoshop user to produce a picture has just as much (or as little) trust in them as the button presser for DALLE.


This gets complicated, having no reason to know that something is copyrighted is a defense.

So if your employee installed pirated 3rd party software you’re facing strict liability. However, if a third party is reproducing their collage roommates drawing from memory then it’s effectively impossible for you to verify if something is a derivative work.

Dalle is effectively Getty images, if you’re buying works from them you can only assume it’s free of copyright issues.


The generated content is a derivative work of each piece of the material the model was trained on. That material can be listed.


So your suggestion is to list 100’s of millions of works and have users manually review them? I don’t think that’s going to work.


Problem is, how can you determine if the model contains copyrighted material? The laws governs copyright through ownership, so in order to claim copyright infringement you have to be able pinpoint a specific person and prove that their work is somehow embedded in the gradients, which is not practically possible at the point. It's just like how you can't practically enforce copyright on encrypted data unless you ban encryption altogether.


1. If you know your copyrighted material was in the training dataset is that not sufficient?

2. From a legal perspective do you actually have to prove it's embedded in the gradients? If I draw an exact copy of Mickey Mouse from memory and sell it I didn't think Disney had to prove I've ever actually seen Mickey Mouse before or point to where the image of him is embedded in my brain.


Disney has a trademark on Mickey mouse, but that does not mean that they automatically get copyright on all pictures of Mickey Mouse drawn by others (they don't)


Bad example on my part in that case. I thought some art is copyrighted or am I mistaken? If so replace Mickey Mouse with something copyrighted


My opinion as a SWE who is dating a lawyer (joke, not a serious qualification but it does provide some insight):

Generative models traverse and interpolate high dimensional state spaces. These state spaces are created from input data.

I would argue people do the exact same thing - the first main difference is we can use novel inputs (e.g. we can use images or words to develop our music/temporal state spaces and vice versa). People also are recursive and self referential in a way that doesn't collapse.

Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution. Either traditional copyright wins and we get even more draconian policies (think Disney and their desire to never put anything in the public domain), or we have a free for all (which I don't think is bad for creative works, but certainly for more practical things like stock photos or nonfiction).


I can appreciate how this line of thinking might be attractive.

But IMO the human<>machine comparison doesn't lend itself much credence. We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too. I think some care should be taken when considering if we allow machines to have the same privileges as humans.


> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too

There are no sentient machines (at least yet). Your position is one where you are actually limiting what other humans can do, limiting which tools can other humans have access to. Also, the parameter – according to the law – was always "the same". For instance, there is nothing preventing you from making your own chess league where computers are allowed to compete. FIDE is free to ban you from compete own their leagues or to ban anyone associate with your league or whatever, but there is nothing in the law preventing you.

I have been saying this from the day one: this whole debate it's mainly white-collar workers negatively impacted by automation making up any excuse they can to say why their job should be protected, somehow, for some reason, but not the one of coal miners or what have you.

A human downloads a photo to learn how to draw. Another human downloads a photo to teach their computer how to draw. No difference, no need to obtain any license in any of the cases.


> We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too.

Generally speaking, even one machine can do something, it doesn't automatically mean another machine is allowed to do that.

For example you can drive car with a normal driving license, but not a truck. In some states you can own a pistol but no automatic rifle.


It also depends on where this happening. For instance, you don't need a license to drive a car inside your own private propriety. You need a license to drive it on public streets because society needs some assurance that you know what you are doing. So in many cases the laws and restrictions also happen in relation to a given scenario.


copyright exists among other things to "promote the progress of science and useful arts".


That section is written in parallel verse, with copyright <> science, and patent <> useful arts. This sounds weird, now, but it's consistent with the use of the words at the time, which is the reverse of how they are used today, where paintings etc are considered art, and inventions are considered science. So, it's not that copyright exists to promote science and art (as we call them today) but only just the arts. Patents are for science. Authorship reflects copyright and invention reflects patent:

> Congress shall have the power... To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.”


A machine is just a tool. It is the creator and the user of the machine that has the privileges he uses the tool with. I think we should be careful not to anthropomorphize, attribute agency, responsibility and autonomy to something that is essential a better photoshop plugin.


I don’t think parent anthropomorphizing anything. The ones who anthropomorphize are saying that machines should be covered by fair use, because they have similarities with humans.

This is not about the rights of a machine but about how one human product is consumed by another human product. This is just a commercial supply chain: if you make a model, you need human data. You generally need to compensate your suppliers of “raw material”.


Its not the tool that is covered by fair use. It is the creation of the tool that is covered by fair use.

Is the tool itself supposed to be a copyright violation or is it a tool facilitating copyright violation by producing violating output?

The later is something that can be tested because we have processes to compare works of art for it. If it is shown that LLMs produce mostly infringing art then we can and should ban or heavily regulate them. If not then not.


> It is the creation of the tool that is covered by fair use.

Copyright doesn’t restrict creation of something, it restricts (mainly) commercial distribution. Research, education and journalism etc are largely unaffected, and would still be.

That said, I believe that selling access to the tool to the public already violates the copyright of the rights holders, even if it doesn’t produce similar works of art. The copyrighted works increased the value of the product (otherwise why would they use it?).

> The later is something that can be tested because we have processes to compare works of art for it.

This is the most expensive, least practical and most arbitrary part of existing copyright. It would be a huge mistake, imo, to expand this dramatically. This problem mostly goes away if the supply chain is sanely regulated.

All you’d need is give access to the training set upon audit, and bureaucrats could check for copyrighted works. There are already automated tools for this.


"That said, I believe that selling access to the tool to the public already violates the copyright of the rights holders, even if it doesn’t produce similar works of art. The copyrighted works increased the value of the product (otherwise why would they use it?)."

So it is similar to how ISPs argue that they should get a cut of streaming services because they enable another product.

I think it is also relevant that more than half of the globe will just completely ignore any regulation and any artist in a country with regulation will just have to compete with ever more empowered artists using all ai has to offer.


“It’s just a machine!”

So are you!


Don't be obtusely misanthropic


The value of copyright is going to vanish. There is enough public domain material to train models on and to avoid the problem altogether.

There used to be professions like tinkerers, bards, clowns. The tinkerers disappeared when the society became modern. The clowns on the other hand managed to lobby for laws that put people into jail for heinous crimes like copying pictures, and survived longer. They are going to bite the dust now.


What you describe would result in the opposite - copyright will be incredibly valuable in a system where the vast majority of "creative works" are just regurgitations of past works in the public domain, churned out by machines. In such a world, none of that has a copyright anyway. Actual creative works, which do garner copyright, will then be that much more valuable, because they will continue to be a property right with a breadth of coverage to make them useful.


Whether or not “humans do it” isn’t relevant. You can walk around with a copyrighted song in your head. That is not copyright infringement. But if you take that song, create a digital copy, and distribute it for money, then you are violating someone’s copyright. Additionally, our legal system requires a balance of probabilities. It’s hard to prove that someone was influenced by another work unless the similarities are plainly obvious. The same does not apply to ML models where the training data and algorithm are knowable facts.


I challenge you to listen to 4 chords of awesome and tell me again about how every song is completely original. How does eragon exist when it's definitely ripped parts from star wars, etc...ai usually doesn't spit out a full plagiarism, but a loosely inspired work which is what most media we consume is.

Edit: 4 chords of awesome link is https://youtube.com/watch?v=oOlDewpCfZQ&si=8vL6PbDnHiaffJh3


A copyright in just Eragon would be incredibly thin, for the exact reasons you state. This criticism of copyright by people that have no understanding of actual copyright law, how it works, how its used, etc, is so exhausting and ignorant.


“Every song is completely original” is the opposite of what I said.


The analogy doesn't hold when you consider the sheer scale of the problem.

I can outright buy a machine for a few thousand dollars that can crank out a faithful rewrite of every Stephen King novel without the shitty endings and nonsense plot points. It can do it in a few days, maybe a couple of weeks at most.

To do that with human labor would take years and cost hundreds of thousands, if not millions of dollars.

Instead of paying an artist a couple hundred for a commissioned drawing, I can just scrape up their entire portfolio and generate any image I want with their style. I can generate hundreds or thousands of images. I can take their distinct style and use it exclusively as the branding for my company.

What a ML model does is very fundamental not what happens when a human draws inspiration from prior art. A human would require an extremely significant amount of time and resources to perfectly imitate every artist they have ever seen. It takes a human significant time and resources to produce faithful variations on prior art.

A ML model is measured in words or images per second.


Hello.

Maintaining a system like Netflix or AWS or even Amazon will require insane amount of people and time, if possible at all within a finite time, without all the computers doing work for us in seconds that would take humans ages to do.


> ... a SWE who is dating a lawyer

> I would argue people do the exact same thing

Perhaps a ménage à trois with a neuroscientist would change your view on this.


> Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution.

This is the rub. Without reverse attribution... open source anonymous models become a free-for-all loophole.

Since that doesn't currently exist, I think the best we can do is to say that any commercial entity using a model bears the responsibility of proving the model they use is untainted by copyrighted material (to which they haven't secured rights).

Open source model X is... whatever it is.

But I'll be damned if OpenAI / Meta / Microsoft / IBM should be able to build a commercial product on top of laundered copyrighted material while ignoring provenance.

I mean, we have models for this: software code and art. Both aren't clearly attributable. In the case of software code, we've developed case law around clean room design and similarity. In the case of art, we value verifiable chain of custody.

Hopefully, something similar would tilt commercial funding of AI in the direction of responsible use.


My problem with this is that artists learn by studying other artists, cutting that off because it's AI rather than focusing on whether the resulting work is derivative, seems more of a problem to me. It seems to me that an AI can be used for either original work or derivatives, proving that you can get derivatives out of it has always struck me as no different than commissioning a copy of someone's work from a human artist and being shocked that you got what you asked for.


Can an AI express to you how van gogh affected it as an artist? I'm not sure that AI is "learning" the way we say humans are "learning," when humans learn and study art. Obviously there is no debate that you can input van gogh into a model and produce something van gogh-like as a result. But I've not seen anything that indicates that the AI is learning anything about van gogh at all. Perhaps it comes down to whether you think learning van gogh is just creating a mapping of all of his brush strokes ever, and only exactly what they look like. It's obvious the AI knows nothing more than that. If you think that's what humans do when they learn art, I'd be sad for you!

As to your hypothetical, we don't give copyrights to people who make rote copies of things, human or otherwise. Is the implication of the shock, that there is sufficient difference with the work as to render it a derivative and not a copy? Okay, how so? And of what consequence? Making derivatives of a copyright without license is infringement.


I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

> As to your hypothetical

That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.


>I think it's learning styles in a way that's at least partially analogous, because it comes out with things that are reasonably original and not in the training data.

I don't think that is evidence that what it is doing is "learning".

>I'm sure an LLM can write you an essay like that for any artist you want, but I'm not all that convinced those are meaningful even with humans.

Well, it wouldn't be reflective of what the LLM thinks, so what is your point? If you are of the belief that humans don't have thoughts, I guess it's not a surprise you view things this way.

>That's the thing, it's not a hypothetical, it's a past story from here on HN. Someone did that, asking for copies of a famous painting (Girl with a Pearl Earring) and got highly derivative items out of the model and we had a debate over whether that even means anything, because that's both a simple description of the painting and the name of a famous work, so it makes it so it can be ambiguous whether you asked for "Girl with a Pearl Earring" or a girl with a pearl earring in the prompting.

You say derivative but without any reference to what it actually means... what about is derivative - that's the analysis that's happening in court. The analysis isn't "what you asked the LLM" because that's not dispositive to whether or not something is a copy.

>I agree that it looks like copyright infringement whether it's done by a human or AI, though. I guess a lot of people missed the prior discussion on HN.

Sorry I don't read every single thread about copyright on HN? This is the second posting I've seen on the RFC today. Give me a break!


> I don't think that is evidence that what it is doing is "learning".

When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

> You say derivative but without any reference to what it actually means

We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

> Sorry I don't read every single thread about copyright on HN?

I'm not faulting you for not knowing, I'm faulting myself for assuming too much context and just trying to explain what I had in my head when writing that so you could understand how I came to think that. Hopefully this lets you see where I'm coming from.


>When I say learning I mean something like "gaining new ability by studying how others did the same task, resulting in being able to produce novel output." I'm not quite sure what you are using the word to mean here, though I might agree that there are differences between what AIs do and what humans do, the question being what they are and whether they're important here.

I think the dictionary definition is more than sufficient: "the acquisition of knowledge or skills through experience, study, or by being taught." This is what I mean by running with your own made up definition.

>I don't claim to know anything about the internal experience (if any) of an LLM writing such an essay and I can't really reason about that because I've never been an LLM, whereas I can at least relate to human experience. I think your assertion that it "wouldn't be reflective of what the LLM thinks" is a bit like saying that you don't think submarines are actually "swimming," as the saying goes, though. It may not "think" in human terms as we do, but it's certainly doing some kind of calculation that produces an equivalent output, so I have a lot of questions about whether we can say that on principle. We're well past passing the Turing test for a lot of things, either the original or censored form, these questions are getting less academic by the day.

You are the one redefining words like "think" and "experience" not me. I'm not playing that game at all. After all, you are the one that is equivocating these processes between humans and AI by coming up with your own, much more broad concoctions.

>We're talking about copyright law, so the meaning of derivative was borrowed from that, i.e. that AI model was producing works that could be reasonably thought to have infringed on the copyright of that painting when prompted for "a girl with a pearl earring" and this was held up to mean that AIs are just regurgitating training data and are therefore implicitly missing something essential to being an artist or what have you and all their work should be considered derivative works of the training data as far as copyright law is concerned.

I'm familiar with copyright law, I'm not sure you are. A work can be derivative in a number of ways, some are legal, some aren't. It's not a new thing that some uses by a machine can be infringing, and others, non-infringing. Why now must it be that machines should be analyzed the same as humans all of the sudden?

>Meanwhile, I'm saying that I think the AI should be judged about like a human artist would be to argue against the people who seem to want to say that the AI can't take input from copyrighted things without all of its output being tainted forever. We have no such requirement for humans and I don't see why it makes sense to add this new restriction on AIs specifically.

Yes, I understand that. But I asked why it should be judged as a human, and you are saying because it "learns". But that's only based upon your re-defining the concept of learning in order to make it inhuman. The only reasonable arguments I've seen that AI outputs should be copyrightable are based on them being a tool that an artist can use. What you are saying is just dressed up anthropomorphization.


> I think the dictionary definition is more than sufficient: "the acquisition of knowledge or skills through experience, study, or by being taught." This is what I mean by running with your own made up definition.

I mean, if a human looked at a bunch of art, essays, etc. and then was able to produce similar works, we'd normally consider that "learning." What word would you use for being able to reproduce Picasso (or whomever) by looking at a bunch of examples?

Also I don't think I have defined "think" or "experience" at all. But I'd point out that I don't see anything like a principled boundary around them or that we can point to something that humans do that AIs don't or can't do. It seems to fall back on something that looks like qualia or subjective internal experience and philosophy hasn't resolved that with respect to other humans... except by analogy. "I think the other humans are like me and I have subjective internal experience, so they probably have it to, rather than being p-zombies."

If you have a better answer to that, feel free to tell me, it'd be interesting.

> It's not a new thing that some uses by a machine can be infringing, and others, non-infringing. Why now must it be that machines should be analyzed the same as humans all of the sudden?

Sure, I'll agree that it's not even necessary to consider the works transformative or whatever.

FWIW, I don't think that AIs should be getting their own copyrights or anything like that, I'm just saying that the training data shouldn't forever taint the output no matter what's produced.


>I mean, if a human looked at a bunch of art, essays, etc. and then was able to produce similar works, we'd normally consider that "learning." What word would you use for being able to reproduce Picasso (or whomever) by looking at a bunch of examples?

Would we? What you described sounds a lot more like copying than learning. That's why I asked the question I originally did. Your whole perspective seems to be based on an ignorant and misanthropic view of the arts. That art students just go to school to look at things so they can then reproduce things that look like those things. It's a bit asinine and insulting.

>Also I don't think I have defined "think" or "experience" at all. But I'd point out that I don't see anything like a principled boundary around them or that we can point to something that humans do that AIs don't or can't do. It seems to fall back on something that looks like qualia or subjective internal experience and philosophy hasn't resolved that with respect to other humans... except by analogy. "I think the other humans are like me and I have subjective internal experience, so they probably have it to, rather than being p-zombies."

That's your burden to demonstrate as the person equivocating AI to humanity. You couldn't do it with "learning" without redefining learning, and you can't do it with "experience" or "think", without redefining those words either. Who is seriously advocating that LLMs are thinking and experiencing? I haven't seen anyone make those arguments.

>Sure, I'll agree that it's not even necessary to consider the works transformative or whatever.

That wasn't my point. A transformative analysis is one of the most fundamental elements of determining if something is a copy or not in copyright law. So I don't really have any idea what you are talking about with this one.

>FWIW, I don't think that AIs should be getting their own copyrights or anything like that, I'm just saying that the training data shouldn't forever taint the output no matter what's produced.

Yeah but your only argument for that is to redefine learning to pretend it's the same thing that humans are doing when that's clearly not the case.


> Yeah but your only argument for that is to redefine learning to pretend it's the same thing that humans are doing when that's clearly not the case.

What test can I do to differentiate them, then?

At first, you said they couldn't write an essay... but AIs can absolutely do that. The internal experience of even other people is unknowable and something we guess by analogy, so if you want me to agree you need some other actual test on measurable outputs to differentiate.

Otherwise this is all about qualia and there's no way to come to rational agreement.


You are being obtusely literal, as I did not ask you if they could write an essay. I asked you if they could express their feelings. There's no point in us conversing if you are going to respond this way, as it's disingenuous. I'd think you are capable of understanding the difference between the two. And I don't care if you agree with me or not, it's your burden to elevate AI to humanity, not mine, and you haven't done it here. Your perspective here seems to come from a life devoid of art and experience in things. For that, I'm sorry for you.


> I asked you if they could express their feelings.

And I asked how we can test whether someone has actual feelings or any other kind of conscious internal experience. If it's "obvious" then why is there no consensus on the whole https://en.wikipedia.org/wiki/Philosophical_zombie thing?

> There's no point in us conversing

I gave this conversation to an LLM to respond to.


I only said it was obvious that LLM's don't know anything about art past what you described, which you didn't dispute and was an obvious logicaly conclusion from your own explanation of what AI "learned".

>I gave this conversation to an LLM to respond to.

I'm not surprised, I repeatedly characterized your responses as obtuse, disingenuous, or ignorant. I'm not sure what you think you proved.


You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

Most LLMs are just profiteering from people’s labor without their consent. And there’s nothing new being produced. It’s always a statistical output of previous works.


> You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

The same would automatically apply to LLM output -- there's no need to change the current laws to cover that case.

The question is this. Suppose I ask a human artist and an LLM to create me a new female mouse cartoon character. And suppose both the artist and the LLM have been exposed to Minnie Mouse. It's not unlikely that the new character created in both cases will have aspects specifically similar to, or specifically opposite to Minnie Mouse.

In the case of the human artist, the new character will not be covered by Disney's copyright, unless there was a lot of copying. Why should the result be different for LLMs?

The logical conclusion of "any output of an LLM that's seen Minnie Mouse must be subject to Disney's copyright" is "any output of any human that's seen Minnie Mouse must be owned by Disney". Which I'm sure Disney would love, but would certainly make the world a worse place for everyone.


> a pin-up version of Minnie Mouse

that's not because of copyright, but because of trademark. If you make the minnie mouse sufficiently different that it cannot be mistaken for not being Minnie to the average person, and don't call it minnie mouse (to get rid of trademark), disney will have a much harder time suing you. Of course, they will still try, and steam roll you with just money instead.


> And there’s nothing new being produced. It’s always a statistical output of previous works.

I don't think you can define those terms such that what you say is true of AI but not true of people.


I think you're misunderstanding that, I don't expect it in either case, I'm saying you have to judge the output not the input. So even if it trained on a ton of copyrighted artwork, if the output isn't a ripoff of something in the training data, I don't think there should be any copyright issues.


Is intelligence really a factor here?

Say I use the same training set as one of these LLMs, copyright protected text and all, and use it to derive a compression algorithm that uses very little space to store tokens and token sequences that are common in that huge collection of text. The resulting compression scheme includes some sort of statistical artifact derived from that copyrighted text. Is that allowed? And if so why is an LLM different?


Very good question indeed.

A lot of these questions are somewhat ethical/moral in nature. E.g. is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm? I don't know.

It's awesome to see the Copyright office request input from both sides of the argument.


It worries me that so much focus is on two sides that may not have the end-users' best interest much in mind. The companies building the models may have an incentive to regulate models to keep smaller players or open source projects away. Artists mostly seem totally anti any solutions as even laws that allow models trained on purely public domain art would be bad for them. If laws around this are shaped primarily by the wishes of those two groups I am not sure things will end up well at all for those of us that want the tools to keep improving and remain reasonably free (including applications you can install locally and run on your own GPU).


> is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm?

and the test i use is: if they currently allow a human to perform this same task, then it is allowed to be done using an AI model.


LLMs are generative though not just compressive


Generation, prediction, and compression are all the same - the only different thing is the intent.


> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

None of what you are saying has anything to do with copyright.

The tool Photoshop isn't generally intelligent either. And yet, yes it can be used to create art using other people's stuff.

And it could be done legally if the results are transformative.


Photoshop doesn’t install with a massive directory of other people’s copyrighted works to draw snippets from.


Yes it does...


If it does, then Adobe would have commissioned or acquired the license. In either case they would have _paid_ someone to get those images.

It is very unlikely Adobe would be shipping their software with copyrighted material without paying for them first.


I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression". Copyright and "lossy compression" are pretty easy to reason about. Model "building" is "compression". Model "use" is "decompression". Everything about these AI models seems to be about the "lossy" part, but "lossy" is just an adjective to the main show.

It's very difficult to not conclude that copyright of a trained model should be treated identically to the copyright of a zip file.


Information is not copyrighted, just the expression of said information.

So if you took a recipe book, extracted the recipe information, and listed out the recipe in a different format (such as a table), it's a new work. It does not violate the copyright of the recipe book you extracted the info from.


> I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression".

If you feed a photo of your dog into a JPEG compressor and the result looked like a cat in the same style, I think you'd be pretty annoyed.


When you perform lossy compression, you feed it one file at a time, not every file in existence.


If you concatenate images into a stream container (say as tar) and then compress the stream, the compression coding will (generally) cross over the individual images. True, that's generally not lossy compression.

But concatenating images is also how you create video. Lossy video compression does typically cross over frames. So I don't actually see a difference. If you want to think about mkv or mp4 instead of zip it's still the same concept.

There's nothing stopping you from putting every available image into a video and figuring out how to compress it lossily.

Maybe there's some bounds for how much information was lost? Obviously piping everything into /dev/null destroys the input. And piping /dev/random from a true random source creates information. So somewhere between that and lossless compression there's the nebulous "plagarism" threshold. And then there's another threshold that is copyright infringement that's considered "fair use".

But the general structure of the "AI" this is about are fundamentally storage and retrieval.


What does any of this have to do with creating a new expression?


What makes anything new? Is anything created by "AI" actually new? How much entropy is in a prompt vs in the output?


>What makes anything new?

In copyright law? It's not being a copy


Some compression, yes, but the analogy oversimplifies. AI rerepresents input information in a transformative way (embedding, say) then creates new, derived and combined output from a new input (e.g prompt).

It's not just lossy compression. It's potentially novel.


Phrases like "transformative way" are meaningless woospeak to me. Everything is a transformation. Sulpose I run a linear convolution on ten images and average them. Is the result "new"? Does it not contain the original images? Subspaces and mappings don't create anything "new" any more than SVD does. This is just playing digital Ship of Thesius.


> Phrases like "transformative way" are meaningless woospeak to me

Fortunately we live in a society that supports specialization where something that is woospeak to a smart person can still be a very well understood topic. AI transformations are methodologically well documented, even if transparency of neural network node activations is yet to be fully formalized.


In that case, you'll surely be able to provide a citation that clearly distinguishes the differences between the ways of transformations performed by "AI" and the ways of transformations performed by compression.


Sure. AI (more specifically, ML) is curve fitting, and more generally, objective function optimization. https://en.m.wikipedia.org/wiki/Curve_fitting

A projection is not compression, necessarily. And you'll find AI is a very poor compressor when used for such a purpose in all but the most trivial setups (e.g SVD matching input data rank, only reversible functions in neural network activation, etc.).


Congratulations, you just discovered that copyright is a weak and ill-defined concept.


I think that unless you can clearly show that an "AI" is not a form of compression, the question of copyright is orthogonal. The copyrights that apply to a zip file may be ill-defined concepts to you, but it's not really important to the core question which is: how are model weights different from a zip file? If you put unambiguously copyrighted content into a zip file, most people would agree that the copyright applies to the zip file. So by analogy if you put copyrighted content into model weights, the copyright applies to the model weights. Issues such as what constitutes fair use comes up, but fair use is permissible copyright infringement, not absence of copyright. And that's where the question of how lossy a compression algorithm has to be to be considered "fair use". In all likelihood it's the specifics of the use itself (rather than technology or method details used) that matters.


It’s compression + filtering. Nothing generative. Its output is like 99.99 % deterministic.


Linear regression is 100% deterministic after training and isn't lossless compression, but rather a linear projection of along a manifold in a (potentially transformed) input space.

So, maybe not just compression+filtering, if level of deterministic behavior is to be the gauge.


Source?


Why is being a statistical model relevant?

The simplest statistical model is an average. Why would the average pixel rgba of a bunch of images invoke the copyright of those images?


The crux of the AI copyright argument sits in economics. Those currently producing content want future content generated from AI to benefit them financially, as long as a thin sliver of their own content was used in the training.

This is like asking all the student to pay their teachers a (small) percentage of their future economic output.


My opinion is we should treat AI like photoshop/word/windows. If you use windows to copy a file and distribute it, Microsoft isn't liable you are. If you use word to type up a book and sell it, you're responsible.

Same with a statistical model, if you general a copyrighted work and distribute it you are responsible. But the tool (GPT-4) maker isn't responsible just like Adobe isn't responsible for copyright infringement.

The copyrighted text/image isn't generated until you ask it to. Your prompt is what reproduces the material.


Why would any non-lunatic want to live in a world where someone can't import an image into software?

If only some software is disallowed, then why permit Excel but prohibit Stable Diffusion?

Can someone even look at a SD-generated image, and claim with certainty that their own art was used to train it? Any more than claiming that another artist was inspired by it, looking at their output?

I'm fine with anything goes. The alternative seems to be copyright maximalist clownworld.


> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

But then you are just shifting the problem forward by an inch. What happens when tomorrow someone declares that their model is generally intelligent and is therefore allowed to disregard copyright when training just like a person can?


This point is of the utmost importance from a public policymaking perspective. Laws such as these are easy to craft now and difficult to change later. I feel like we are previewing an unfolding disaster here.

The future will clearly yield a class of "beings" striving for some degree of indistinguishability from or coexistence with humans. Proposals that discriminate --literally discriminate -- without respect for the principles of universality and equal treatment under law are creating and condemning a marginalized group before it even reaches maturity. This is an old and tired theme repeated through history. Let's foresee this and not get it wrong.


Is it your experience that people's facial declarations cary the day in legal disputes? It's not mine. Rather, it seems like the whole thing is designed to provide scrutiny against bare facial declarations that something is true or false.

I see this on HN all the time "someone just has to claim" "someone just has to say". Yeah... that's not how it works. People can say whatever they want, that doesn't mean it satisfied their burden of proof. Self serving testimony is the lowest form of evidence imaginable.


Intelligence lacks any legal definition, for starters. And if a law like that will provide an arbitrary line in the sand, it will just disincentivize AI research in general.


Often, when laws are passed, they provide definitions for the terms in the law that require definitions. Regardless, I'm not aware of any proposals for copyright law where "intelligence" is used.


I agree completely. AI model trainers should have to pay the people who provide their training materials, and there should be a default assumption of opting out until someone or their company explicitly opts in.

Unfortunately the Peter thiels and all those bizarrely out of touch silicon valley assholes have already effectively scraped the Internet because ethics don't matter if you're special like them, so to a degree regulations are way behind the ball.

That said it's still worth doing, and I'd love to see it done retroactively as well. It's not as if "I forgot that I had a public Myspace 25 years ago" is an implicit user opt-in for some startup to save your data - however anonymized they claim it is (lol!) - and train its AI on it.


> The alternative seems to be “anything goes”.

Seems like a huge false dichotomy. You really can't imagine anything in between total shutdown of AI training on public data sources and no rules at all?

I think we should try a bit harder for a middle ground.


I think you are right. People argue if LLM's store or maybe generalize. I propose an experiment for anyone interested. Try and do this prompt multiple times and change the appropriate verse numbers:

> Provide quote from King James' Bible Genesis :25-31

or

> Provide quote from King James' Bible Genesis :1-25

or whatever you fancy.

I didn't go through the whole Bible, but I got pretty much a verbatim chapter. I argue that you can't do this with copyrighted books only because of guardrails and not chatgpt's lack of capability so the information is there, and it's verbatim. Plus other books don't have such nifty indexing.


Because the cat is out of the bag so to speak, any attempt to force ai companies to generate their own content to train on means we are signing up for a future where only multi billion dollar companies are in control.


If they were truly forced to do this, even they would find it difficult.


And everyone else would find it impossible.

Hence the headlong rush to implement regulatory capture.


Is there any precedent where copyright was focused on the input rather than the final published work?


Compilers


Object code is a derivative work I think.

So no. Compilers do not count.


The US had to update copyright law to explicitly protect binaries


That just means some judges got it wrong and congress really wanted to make sure others didn't. I'm not sure what proposition that stands for here, except that sometimes new things are hard to get right at first.


Remixes, generally?


This is more of a problem for images, where similar output to inputs is likely, than for LLMs, where no matter what you prompt it with I doubt you can get it to regurgitate any significant parts of Harry Potter well enough to be a classical copyright violation of any of the novels. Maybe you could generate a copyright violation of character traits.

The output space of images (MB for larger images) tends to be larger than books (a few hundred KB of text for a long novel), but the perceptual output space of books is much larger.

Any determination that licensing is required for AI generation, or use of AI-generated works, is unacceptable until Congress or courts put some reasonable objective tests in place to determine what is and isn't a copyright violation for various types of works of various lengths. Not the ambiguous 4-factor test that is basically whatever the judge feels like. It will be a complete mess otherwise. They can't just define a new AI policy for copyright with a few types of works in mind; it has to work for all works.

You could look at this mathematically from a complexity perspective and try to define a similarity function that's true when a second work is close enough to a first work to be a derived work (assuming the first one had been seen by the creator of the second). Unfortunately that won't work because nobody can define such a function to everyone's satisfaction, and the courts wouldn't accept any informal suggestion of a definition when it didn't come from Congress. Specifically, you'd get into trouble with consistency in the function determining derived works depending on length of the work: short works, like a haiku, are much more sensitive to copyright violation in some ways... a mere 17 syllables is a complete reproduction and therefore a copyright violation, yet a single word isn't; for a novel, reproducing 1/17 of the content is almost certainly a copyright violation, but reproducing 17 syllables probably isn't.

Different stakeholders and creative re-mixers would want different things from the function. It's untenable.


> This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation

That is a fairly illogical leap. From your text alone, “should not be allowed to disregard the copyright of its source material” would be: “the AI’s maintainer should have a fairly reliable (but not infallible) system to output how likely it generated something that is a direct derivative work of something in its dataset”. As a human you don’t need to attribute/license every piece of art you’ve seen of clouds if you draw a cloud. So if an AI draws a cloud that is actually derivative of the millions of clouds it has seen, then it doesn’t need any permission from the millions of creators to draw one either.


AI is taking work away from lawyers, and instantly creating more work for lawyers.

Ain't that interesting to reflect upon?

I speculate there is a hidden force in the universe, something physicists are yet to identify, which mandates: "they shall always have something to do".


The human brain is no different. It generates content from the things it learned.


Repost #4 I believe

https://hackertimes.com/item?id=37305580

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."


I hope your opinion isn't shared by lawmakers. Copyright is a relic of the past, and it needs to be put out of its misery. Trying to (mis)apply copyright here would just lobotomize the US. Existing companies would just technically operate out of a saner jurisdiction, and we'd be handing other countries a golden opportunity to leapfrog the US.


"anything goes" is the best and most natural solution. Just don't let people copyright the output if they don't have full copyright on all of the inputs. This should finally get rid of the cancer that is copyright in a generation or two.


Generic reply to siblings here… I get the intelligence argument.

My _main_ point is that there’s a non-trivial question to answer here.

I’m not qualified to answer (though I’ve offered up my non-expert opinion). It certainly seems to quickly veer in to philosophy!


It shows you are not a lawyer. You misunderstand how copyright works. Creating copies or derivative works and distributing those is all that matters under copyright. This is not "disregarding" copyright (which is not an actual thing) but something that is either fair use or may require some kind of permission from the creators of the original by those distributing some kind of derived work or copy. That's why it's called copyright.

Copyright merely restricts the distribution of original works or their derivatives. In case of an infringement, copyright holders can insist you stop distribution and/or compensate them for that.

If I sell you a paint brush, I'm not liable for you putting a red nose on the mona lisa and trying to sell it off as an original work. Doing that on the original would be an act of vandalism (because you don't own it) and doing that on a replica that you got from somewhere infringes on the rights of those that created the replica. Which is a derived work or copy in itself of course and the distribution of that is regulated by copyright. Distribution of such a replica is of course fine because Da Vinci has been dead for a very long time and his work would no longer be protected under copyright. Distributing your red nosed mona lisa would therefore be fine too. Either way, the paint brush seller is no party in this case this is between you, Da Vinci, his descendants, and the replica creators.

Now your assertions as to what AIs are of aren't, are simply not relevant. You assert it's a statistics algorithms thingy. That sounds like a tool to me. Yet another paint brush. Using a paint brush is not infringing on anyone's rights. For that you have to distribute the results of your work. The nature of the tool does not matter. How you use the tool does not matter either. You merely create (potentially) derivative works with the tool and what you do with those matters. Especially when you distribute them to others. One of those derivative works is of course the AI model itself. Creating one is fine. Copyright gets potentially infringed when you distribute one.

Now we get to the core of the matter. Can you with a straight face say the AI model resembles the original and is a derivative work. It doesn't actually look like or resemble the original in any shape or form. Even proving the AI model is derived from the original is tricky. Copyright is not about protecting vague ideas or notions but the concrete shape or form of things. And it's only an infringement if you distribute a derived work or a copy of a thing to others. So, merely creating an AI model is not distributing anything to anyone. You are merely using tools to create something for yourself. An AI model in this case.

Distributing a verbatim copy of a book is an infringement. Citing the book in your own work is fair use (up to a point). Paraphrasing elements from the book, acknowledging it exists, taking inspiration of it, or reading it aren't copyright infringements.

The legal problem with AI models is that their concrete shape or form doesn't resemble the original inputs in any shape or form. Besides, companies like OpenAI don't actually distribute their AI models. They are huge; it's not very practical. They merely exploit those models to generate outputs to inputs from their users and customers. Are those outputs derivative works? Maybe, but that's where it gets tricky. They clearly aren't in the classical sense. Not even close. But if you somehow could conclude that they are, who is distributing that derivative work? Secondly, it the AI model is a tool, who actually creates those outputs and are those outputs protected under copyright? Who actually holds those rights? And how would you tell apart such an output from a human created one?

It's questions like this that make all this extremely murky from a legal point of view. IMHO without dramatic changes to copyright law or the way it has been commonly interpreted legally, it's just very poorly suited to do anything about stopping AI companies from doing what they are doing. You'd have to bend the conventional interpretation quite a bit for that. No doubt, there will be court cases where people will try to do that. But it will take many years before the dust settles on that. And I wouldn't get my hopes up on some unexpected/dramatic outcome.


This is generally, but I'm surprised you aren't aware that distribution isn't the only right protected by copyright - creating derivative works is protected, display rights are protected.


I agree though I’d say non-technical people are naive and so don’t know why the experience is not ideal or fun or smooth. I suspect if asked the right questions, non-technical people would also complain.

Computers have always been just useful enough for as long as I’ve used them (since the 80s). We’ve _always_ put up with a lot of nonsense and pain because the alternative is worse.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: