@dotsie @arstechnica Besides, _I_ can remember much of the Hitchhiker's Guide to the Galaxy word by word. Is my brain illegal now?
No LLM stores a copy of any text, that wouldn't even possible, the LLM retains only a few bits of information from an entire book. What it stores are the patterns, and it can reproduce those patterns. The same goes for image generators, video generators, audio generators. An image generator can be made to reproduce an image from the training set if you give it the right prompt, and while the reproduction won't be perfect, it will often be quite close so that you need to see both the original image and the AI reproduction side by side to spot the differences, but that doesn't mean that the original image is stored in the diffusion model, which probably retains less than one single bit from each image in the end. What it means is that the machine has learned how to reproduce the patterns in the original. Just like my brain doesn't store the texts I have read letter by letter or word by word, it doesn't work that way.
If there was somebody who had memorised an entire library and could recite every page of every book at will, would you call that copyright infringement? And what makes machine learning any different, and why?
I always thought "intellectual property" was a legal abomination designed to keep people from remixing and modifying the culture around them unless they had the money to pay for all the necessary licenses. Therefore I don't think we should use I.P. to try stopping the big AI companies; instead we should use this opportunity to attack the very existence of I.P. and work towards a culture where everything is in the public domain, every piece of software is open source, and the AI companies don't own the AI, everybody does. Just imagine how much energy and raw materials and human labour and effort are wasted to invent the wheel again and again because of all those commercial enterprises trying to build their own machine learning models.
If all AI was open source and everybody could use and modify it for free, we wouldn't need yet to spider all the websites and run all the computing centres at full capacity to train yet another huge neural network, we could first take a look at the models already in existence to see whether they can do what we want them to do. If a model was almost but not quite perfect, we could tinker with it, fine-tune it, maybe train some LoRAs that can also be used with similar models derived from the same ancestor. We could just install it on any sufficient hardware and run it, and instead of making AI models bigger and bigger, we could make them more efficient so that they can run on a single laptop or maybe even a Raspberry Pi without any external computing centre doing all the heavy lifting. And AI training could be a collective effort with people donating CPU und GPU cycles to an AI model they want to see finished, like BOINC with all those "@home" projects (SETI@home, Folding@home, etc.). We need to take the tools away from the rich and put them into everybody's hands. Intellectual property is a trap. It doesn't even help the artists and musicians and writers, except for a few rich and famous superstars, it keeps them from remixing our modern culture like creative folks have always done in the past because some piece of music you want to include in your own is owned by some record company, some character you want to include in your story is owned by some publisher or media conglomerate, some element in your image is a registered trademark of some multinational corporation. AI is just the entire sampling and remixing war coming back, only this time, it's some multi-billion dollar companies arguing that their machines should be allowed to do what humans may not do. The answer to that should not be, no, your machines can't do that, but it should be, fuck it, let's scrap I.P. altogether, free digital and analogue culture for all, everybody wins.