Mastodawn

Crista 💬Nov 8, 2022

Will be following this case closely, for many reasons:
https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data

The lawsuit against Microsoft, GitHub and OpenAI that could change the rules of AI copyright

Microsoft, GitHub, and OpenAI are being sued for allegedly violating copyright law in the creation of GitHub Copilot — an AI coding assistant trained on open-source code. In an interview with The Verge, the lawyers filing the suit explain their motivations and the future of AI copyright.

The Verge

Show thread

Jed Brown Nov 8, 2022

@crista I'd like to see this discussion center the concept of plagiarism.

If you give a human source material and they turn around and write a paragraph that's near-verbatim from the source material, that's plagiarism. Same for code. And the human can even *understand* what the computer has no concept of.

They could have followed clean-room reimplementation practices, with separate systems observing source material and generating output. But despite the AI mythology, that's too hard.

Show thread

Crista 💬Nov 8, 2022

@jedbrown it is a fascinating problem! When it comes to parroting open source code (which copilot does), the least they could have done was to search for the original source and attach both the source and the license

Show thread

Jed Brown Nov 8, 2022

@crista Yes, but admitting that would undermine the AI hype, risking an informed public and healthy regulatory posture.

I'd have expected them to swap synonyms and apply light control structure isomorphisms so it's not so obviously parroted. MOSS already exists to test how much obfuscation is enough. I'd love to hear an internal account about why that wasn't done; it would have made legal challenges much less compelling.

Show thread

Crista 💬Nov 8, 2022

@jedbrown Postprocessing for source obfuscation would be equally embarrassing, not to mention upsetting open source devs. Same rules as those I use in my courses: copy-paste all you like; just don't omit the source, pay attention to the license, and (for students) show understanding of the code by explaining it to a TA. The AI hype is in need of a snap to reality, and this lawsuit may very well be it.

Show thread

danny "disco" mcClanahan

@crista @jedbrown there are known examples of source obfuscation in copilot which i have provided along with technical background to the litigation team

Show thread

Jed Brown Nov 8, 2022

@hipsterelectron @crista Oh, fascinating. I'm curious how you demonstrated that (maybe conversation for a different modality), but that would mean they obfuscate, but only poorly. (Tim Davis' examples are so verbatim, I figured they must skip it.)

Show thread

danny "disco" mcClanahan Nov 8, 2022

@crista @jedbrown well actually license obfuscation which is separate but related to this conversation although more aligned with the arguments currently made in the suit

Show thread

danny "disco" mcClanahan Nov 8, 2022

@crista @jedbrown sorry for being imprecise. in general i think the approach of attacking license obfuscation where licensing is directly and obviously attached to the source code is a stronger argument than debating the copyrightability of individual source snippets and i'm very glad they chose that legal approach but i do agree that source obfuscation is a large part of the human cost of this tool