Mastodawn

Oh, so now scraping data without permission is bad for AI training? 😂 how ironic 😉

Anthropic accuses Alibaba of using thousands fraudulent accounts to extract Claude AI model capabiliti and data. Anthropic urged Congress to penalise the companies behind scrapping attacks like this and to ramp up measures to prevent US tech from being stolen. https://www.bbc.com/news/articles/cwyklykn5dwo
How about Anthropic pay first for stolen books, and all content out there for its shity ai?

Anthropic accuses Chinese rival Alibaba of illicitly extracting AI capabilities

The firm alleged that Alibaba used fraudulent accounts to access data from its Claude AI model.

Show thread

lnicola 2d ago

@nixCraft @alexmu Model distillation is not scrapping and your source understandably doesn't even mention the latter. Trying to equate them is bad reporting and muddies the waters.

Scraped content can be protected by copyright, while LLM outputs aren't. Anthropic presents distillation as an "attack" because they're worried they'll fall behind the Chinese. This is in line with their previous policy (calling their model a "cyberweapon" and so on). They're fishing for regulation and protectionism.

Show thread

Becca

@lnicola piss off tech bro.

stop trying to muddy the waters. back to the root: this is ALL the fruit of a poisoned tree.

no amount of manipulation can undo the stolen origin

@nixCraft @alexmu

Show thread

Alex M 2d ago

@bweller @lnicola @nixCraft Even if you don't agree with @lnicola, it's ok to keep a civilized tone. Yes, he made a strong statement that he hasn't backed (that mixing scraping and distilling is bad reporting). I'm still waiting for him to say exactly why the difference matters in this context.

Show thread

lnicola 2d ago

@alexmu @nixCraft Not sure if you saw my whole thread, but scrapping is bad mainly because of three reasons: 1. copyright concerns of the website owners, 2. unauthorized access to insecure web apps, 3. CPU and bandwidth consumption issues, especially with less optimized apps like Forgejo.

1 doesn't apply because Anthropic holds no copyright. 2 is not an issue because it's all authorized. 3 doesn't apply because you can distill a model at very reasonable rate limits.

1/3

Show thread

lnicola 2d ago

@alexmu @nixCraft In addition, while scrapping is easy to define (exhaustively retrieving all known URLs from a server), I've shown that model distillation is indistinguishable from very common tasks like evaluating other models or take-home assignments.

Scrapping has clear, direct, downsides (resource consumption), while distillation has none and is indistinguishable from a permitted workload. The only thing they have in common is what, doing remote calls to a server?

2/3

Show thread

lnicola 2d ago

@alexmu @nixCraft So if you feel they're the same thing, the onus is on you to explain why. They're different things with different purposes, working mechanisms and downsides.

3/3

Show thread

Alex M 2d ago

@lnicola @nixCraft While technically (almost) true, the difference doesn't seem relevant in this context. The caveat is because you seem to imply that scraping inherently breaks copyright laws, which it does not. I point you to google, if you have any doubts.

Sure, saying scraping when the original source used distilling is sloppy. But that doesn't make it "bad journalism".

Show thread

lnicola 2d ago

@alexmu @nixCraft I don't really see your point. Anthropic did not mention scrapping, so why would you, as a journalist, bring it up instead of using the correct term.

It's like bringing up your neighbour's dog that bit you when reporting on an article about a new cat disease. Yes, cats and dogs can be pets, but there's no closer relationship.

And if you think web scrapping is legal and almost harmless, may I refer you to all the complaints about "AI scraperd"?

Show thread

Alex M 2d ago

@lnicola @nixCraft There's many ways of scraping. Just because llm companies have aggressive scrapers that disregard robots.txt and don't throttle requests, doesn't mean all scrapers are badly behaved. But generalising from "people complain about llm scrapers" (rightly so) to "all scraping is bad" (I think you may have implied that all scraping breaks copyright as well, which is a non sequitur) is just as sloppy as mixing up scraping with distilling

Show thread

lnicola 2d ago

@alexmu @nixCraft I did not say it's illegal or against copyright, Like Cloudflare puts it, "content theft", "degraded site performance" (https://www.cloudflare.com/learning/ai/how-to-prevent-web-scraping/). It also "wastes application resources, skews analytics, compromises user accounts, and forces developers to build and maintain brittle, custom security logic" (https://www.cloudflare.com/products/bot-mitigation/). I think they're valid concerns, regardless of legality.

None of these applies to Anthropic, so can you explain why scrapping is relevant at all?

Show thread

Alex M 2d ago

@lnicola @nixCraft It's not. You made a point of the difference. You could have ignored the sloppy wording. But you chose to have a go at what you perceived as a post biased against llms.

Show thread

lnicola 2d ago

@alexmu @nixCraft Where does bias come into this? If it reported that Anthropic claims that usage indistinguishable from legitimate as an "attack", would it be biased for or against LLMs?