TL;DR: #AI & #ML data aggregators break the law when they consume data unlawfully, whether or not the output of their systems is considered fair use. Let's talk about the underlying data, not just the tools!

Most big AI & ML data aggregators aren't sharing their data openly in compliance with #FOSS or open-content licenses, or adhering to the #TOS of data they scrape from less litigious content creators. They are trying to avoid legal liability by signing "content deals" with big media companies, but this just papers over the fact that commercializing data in violation of the copyright holders' license terms, terms of service, or contracts of adhesion is a criminal act and that hiding that data inside proprietary databases and models is simply one of the ways these companies are attempting to dodge lawsuits and liability.

Copyright theft hurts independent writers, bloggers, educators, and journalists more than it hurts media moguls. Signing content licensing agreements with the likes of GitHub, Hatchette Group, or the New York Times is all well and good, but this typically doesn't compensate the actual content producers or bring the unlawful aggregation into compliance. These sorts of deals insult every #opensource and #opencontent creator. #DRM isn't the answer, and neither is paying off big media. #OpenAI and others should either pay for that data directly, make the data publicly available under the share-alike clauses common to most open source/content licenses, or exclude it altogether.

Many years ago, Canada took an interesting approach that wasn't based on chasing individual "pirates." Instead, they taxed storage media like burnable CDs. The main flaw in that approach was that it still mostly benefited large companies who could collect meaningful amounts from this tax rather than paying small, independent artists. Nevertheless it was (and is) a better model for society & the creative commons than paying "protection money" to big media conglomerates while continuing to build for-profit business models that violate the copyrights and licensing terms of anyone who isn't deemed a significant legal threat.

As a society, we can and must do better. If our copyright laws (including commercial software licenses and terms of service) are outdated and no longer serve society then they should be updated or fixed. However, deliberate and open theft should never be permitted to become business-as-usual. The fact that the output of such systems may or may not be sufficiently transformative to be labeled theft or plagiarism doesn't change the fact that almost all of today's commercial options currently rely on very large data sets filled with material that belongs to others who weren't compensated for their work, and where most copyright owners lack sufficient visibility to even determine whether or not their work was unlawfully used.

@todd_a_jacobs The institute addressed this issue publicly during the last round of hearings by the US Copyright Office, but the problem continues to grow. The #privatesector in the USA has not addressed it sufficiently since then, while the #EU has begun setting a stronger agenda for #AIgovernance.
The Artificial Intelligence and Data Act (AIDA) – Companion document

Table of contents Introduction Canada and the global artificial intelligence (AI) landscape Why now is the time for a responsible AI framework in Canada Canada's approach and consultation timeline How the Artificial Intelligence and Data act will work High-impact AI systems: considerations and sys

@loleg This is interesting. I haven’t had the opportunity yet to think through all the ramifications, but I’m glad that some countries are attempting the difficult balancing act here. It matters less whether they succeed than that they try.

I’m against neutering AI systems in a fruitless attempt to make them harmless, but I also think risk appetite is always a balancing act. The real failure is when the only risks are to others. That’s the legal, social, and financial path of highest profit, but it’s also not genuine capitalism.

Real capitalism requires people to risk loss in order to gain, but currently the technology industry in the US offloads all the risk to consumers and taxpayers. Every company is now “too big to fail,” but individual consumers, artists, publishers, et al. don’t get bailed out or spared systemic risk.

Collectively, we can do better. I’m interested to see which of these many initiatives do better by doing things better.

@todd_a_jacobs after a brief conversation, here are 2 cents from @cohere about this:

Canada's AI and Data Act (AIDA) presents a unique opportunity for the civic tech community to shape a responsible and profitable AI ecosystem. By embracing a proactive regulatory approach, AIDA aims to mitigate systemic risks associated with AI, contrasting the US model where risks often burden consumers and taxpayers.

Civil society leaders can play a pivotal role in this landscape by advocating for and implementing the following actions to develop innovative solutions that align with Canada's collective responsibility ethos:

- promoting transparent and ethical AI practices that build public trust
- engaging with policymakers to ensure regulations are adaptive and industry-friendly
- educating citizens about AI's potential and risks
- fostering collaborations between tech companies, academics, and community groups

@todd_a_jacobs I've reformatted for readability - my conversation is here: https://coral.cohere.com/share/359bc225-5b30-419a-97df-e4aa153179d1
Login | Cohere

Login for access to advanced Large Language Models and NLP tools through one easy-to-use API.

Cohere

@todd_a_jacobs @loleg The learning is that laws don't matter and you're a sucker if you follow them.

After "it's not technically illegal if we pretend it is an app" like Uber, Airbnb or Crypto we now have the "it's not technically illegal if we pretend no one can really know what went into it" AI.

The lesson is to try to brazenly ignore the law with some kind of excuse, create enough shareholder value that it would be inconvenient to be penalized and enjoy money on the back of fair people.

@todd_a_jacobs I agree in all points but please don't call it copyright theft, it's *infringement*. "Copyright theft" is a propaganda term pushed by the RIAA etc.

@Zarkonnen That’s a fair point. I’m deliberately using the term the way many people outside of IP law think if it precisely because it has been used this way by large corporations.

It seems hypocritical of big tech to call it “theft” by individuals, but when they do it it’s only “infringement.” While that is the correct legal term, it sounds a lot cleaner and much too clinical to describe deliberate pillaging of the commons or private individuals in a less accusatory way than would be the case if the roles were reversed.

If a private individual helped themselves to Meta’s non-licensed source code, would their press release call it infringement, fair use, a right to repair or make backups, or would they just call it “theft?”

This is the fundamental quandary of accepting asymmetrical terms of engagement. Why should a corporation—legally a person under US law—be afforded more gray area or benefit of the doubt than an actual human being? That’s the real ethical and legal question. Allowing the infringing party to frame it otherwise is a strategically unsound position from which to argue an injustice.