Intentionally corrupting LLM training data?
Intentionally corrupting LLM training data? - Lemmy.world
Inspired by the comments on this [https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/] Ars article, I’ve decided to program my website to “poison the well” when it gets a request from GPTBot. The intuitive approach is just to generate some HTML like this: <p> // Twenty pages of random words </p> (I also considered just hardcoding twenty megabytes of “FUCK YOU,” but that’s a little juvenile for my taste.) Unfortunately, I’m not very familiar with ML beyond a few basic concepts, so I’m unsure if this would get me the most bang for my buck. What do you smarter people on Lemmy think?