So why aren't the big AI companies more transparent about what's in the data that they use to train their models?
One reason, experts say, is because they're afraid they'd get in trouble if people found out. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
@willoremus @kevinschaul @nitashatiku It’s really very good; subject of repeated discussion in the ML track at the open source lawyer conference I’m at today.
The challenge I keep coming back to: this media scrutiny is critical to a functioning societal oversight of this new tech, but such scrutiny incentivizes other cos to stop disclosing their data sets. That’s a bad spiral to be in.
@willoremus Remember when Google scanned and digitized All the World’s Books (approx.) without asking authors for permission or offering any compensation? Many people thought that was just fine. They said it would only give authors more “visibility.” They said it was “fair use.”
I’m still bitter about it. This was one of Google’s principal motivations, and few understood that at the time.
@ricci @willoremus Not according to their EULAs.
The biggest shame about this stuff is that people are freaking out about corporate access to their content when the technilogy in question could actually be useful, when this sort of scrubbing and automatic appropriation has been in place since social media became the norm on the Internet.
Our current IP regulation is broken, but not because of this.