> It's not enough to argue about how machine learning works. ML can do tons of things, and GitHub's current methodology leads to something close to copy-paste. Maybe it can learn a semi-original way to write common things like searching in a list, but for more exotic uses like complex algorithms, being unable to actually understand how to code, it basically has to act as a search engine for existing implementations.
I don't disagree that it is over-trained on certain sequences of words, but, overall, I think the generalization fine. It's often pre-prompted with content it has not seen before, resulting in unique new content. There's nothing copy-pasted about this, just a statistical understanding of what usually happens next.
If the pre-prompt is something very specific, ex. "a dog runs in the park while during the middle of the day, while the sky is the color _____" it will obviously output blue. The same can be true when there are only very specific known algorithms that have been used more frequently than others. And, in the cases where it does commit something arguably comparable to copyright infringement, it would probably on the programmer and not on the model for deciding to use it.
> The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.
In very specific cases only does it output copyrighted content. Most of the time it is just outputting a generalization of what is expected. It isn't Disney's content, but human content. Also, just creating content that has some similarity isn't copyright infringement. Satire has been well accepted as fair use.
I don't disagree that it is over-trained on certain sequences of words, but, overall, I think the generalization fine. It's often pre-prompted with content it has not seen before, resulting in unique new content. There's nothing copy-pasted about this, just a statistical understanding of what usually happens next.
If the pre-prompt is something very specific, ex. "a dog runs in the park while during the middle of the day, while the sky is the color _____" it will obviously output blue. The same can be true when there are only very specific known algorithms that have been used more frequently than others. And, in the cases where it does commit something arguably comparable to copyright infringement, it would probably on the programmer and not on the model for deciding to use it.
> The potentially illegal part is offering a service that creates "new" Disney-like movies by assembling parts of Disney's IP.
In very specific cases only does it output copyrighted content. Most of the time it is just outputting a generalization of what is expected. It isn't Disney's content, but human content. Also, just creating content that has some similarity isn't copyright infringement. Satire has been well accepted as fair use.