Oh, so now scraping data without permission is bad for AI training? πŸ˜‚ how ironic πŸ˜‰

Anthropic accuses Alibaba of using thousands fraudulent accounts to extract Claude AI model capabiliti and data. Anthropic urged Congress to penalise the companies behind scrapping attacks like this and to ramp up measures to prevent US tech from being stolen. https://www.bbc.com/news/articles/cwyklykn5dwo
How about Anthropic pay first for stolen books, and all content out there for its shity ai?

Anthropic accuses Chinese rival Alibaba of illicitly extracting AI capabilities

The firm alleged that Alibaba used fraudulent accounts to access data from its Claude AI model.

Anthropic: We scraped the entire internet, art, books, music and called it fair use. We argued that that’s how humans learn and we won’t pay anyone anything.

Alibaba: Cool, same.

Anthropic: Wait, no, not like that.

@nixCraft Indeed Alibaba at least payed for tokens what I suppose is a fairly nice sum.

@nixCraft In my experience it's not enough to merely #block said #AIscrapers, but literally necessary to fight back by sending them *malicious data* with EVERY REQUEST* whilst rate limiting to a crawl to combat their literal DDoS-Attacks!

(max. 1 connection at 75 bit/s per IP & request max. 1 request per IP, 120s crawl-delay enforced, redirecting them to EICAR "Malware" every time they violate said limits, commit Blackholing at Upstream / IX-Level)

@[email protected] @[email protected] you can't enforce that policy.
you don't know which ip is a bot or a user.
and they do only one request by hour... just they have millions of IP (proxy by android app or something like that).
but for cloud ip, yes it's a good policy.

@oldsysops @nixCraft WATCH ME!
Cuz this is my daily doing!
- First of all, I'd block all non-Consumer Networks (i.e. entire ASNs associated with said #AIslop corpos - like #aws & #Azure!)
- Then I do check for #UserAgent and block known #bots like #ByteSpider and dynamically block entire IP allocations (minimum /24) as they get used.
- Whenever I can, I geoblock places that are notorious fir #Cybercrime (Russia, "P.R." China, USA,)

https://www.youtube.com/watch?v=Hi5sd3WEh0c

The creators of TikTok caused my website to shut down

YouTube

@oldsysops @nixCraft I already worked woth others to get #KiwiFarms banned off #ClownFlare, and if I can get a known #RogueISP (that AFAICS only Cybercriminals use!) to fire a client, then I can get #ISP|s to go after "#ResidentialProxy" setups by #AIslop firms for violating their #ToS and creating a shitton of expensive traffic…
- The best way to get corpos to move their asses, is to make something their [expensive!] problem!

Don't forget to automate abuse reporting!

#CloudFlare

@oldsysops @kkarhan @nixCraft
Fine sers,
May I interest you in #iocaine?

I'm feeding slop all the damn day to AI. It's cathartic.

@nixCraft Hah, crooks stealing from crooks is not a crime.

Slop. slop. sloppity slop!

@nixCraft
Ahahahah schadenfreude

@nixCraft

I wonder if Anthropic would be willing to trade, say, 30-40% of it's shares to the US government in return for that protection?

Not that the US government should accept a deal that shitty, but just as a way to know if Anthropic really believes that it's in danger from Alibaba.

I think the pleading to Congress for help is performative bullshit to divert the attention of its soon-to-be-bankrupt-and-highly-litigious shareholders.

@nixCraft wasn't Anthropic trying to argue they should _destroy_ books after scanning them?

"We scrape, so you don't have to."

β€” EvilCorp

@nixCraft L'arroseur arrosΓ©...

@nixCraft @alexmu Model distillation is not scrapping and your source understandably doesn't even mention the latter. Trying to equate them is bad reporting and muddies the waters.

Scraped content can be protected by copyright, while LLM outputs aren't. Anthropic presents distillation as an "attack" because they're worried they'll fall behind the Chinese. This is in line with their previous policy (calling their model a "cyberweapon" and so on). They're fishing for regulation and protectionism.

@nixCraft @alexmu So imagine you're a tech news tabloid using Claude to write articles. You want to save money so you ask Claude for 1000 topics from the past, then you ask GLM, DeepSeek and Kimi to write articles on them in your preferred style, and use Claude to rate which one is best.

To Anthropic, this is indistinguishable from distillation, which they've just called an attack.

@lnicola @nixCraft In what way is the difference meaningful?
@alexmu @nixCraft It doesn't involve scrapping copyrighted content (like downloading Harry Potter and training on it), it harms no-one because you're paying for your usage and you respect the rate limits (unlike hacking around OAuth to use your subscription with another harness, which is against their terms of use) so it's not an attack, and they can't even tell if you're doing something "wrong" (distilling a model) or something harmless like evaluating other models.

@lnicola piss off tech bro.

stop trying to muddy the waters. back to the root: this is ALL the fruit of a poisoned tree.

no amount of manipulation can undo the stolen origin

@nixCraft @alexmu

@bweller @lnicola @nixCraft Even if you don't agree with @lnicola, it's ok to keep a civilized tone. Yes, he made a strong statement that he hasn't backed (that mixing scraping and distilling is bad reporting). I'm still waiting for him to say exactly why the difference matters in this context.

@alexmu @nixCraft Not sure if you saw my whole thread, but scrapping is bad mainly because of three reasons: 1. copyright concerns of the website owners, 2. unauthorized access to insecure web apps, 3. CPU and bandwidth consumption issues, especially with less optimized apps like Forgejo.

1 doesn't apply because Anthropic holds no copyright. 2 is not an issue because it's all authorized. 3 doesn't apply because you can distill a model at very reasonable rate limits.

1/3

@alexmu @nixCraft In addition, while scrapping is easy to define (exhaustively retrieving all known URLs from a server), I've shown that model distillation is indistinguishable from very common tasks like evaluating other models or take-home assignments.

Scrapping has clear, direct, downsides (resource consumption), while distillation has none and is indistinguishable from a permitted workload. The only thing they have in common is what, doing remote calls to a server?

2/3

@alexmu @nixCraft So if you feel they're the same thing, the onus is on you to explain why. They're different things with different purposes, working mechanisms and downsides.

3/3

@lnicola @nixCraft While technically (almost) true, the difference doesn't seem relevant in this context. The caveat is because you seem to imply that scraping inherently breaks copyright laws, which it does not. I point you to google, if you have any doubts.

Sure, saying scraping when the original source used distilling is sloppy. But that doesn't make it "bad journalism".

@alexmu @nixCraft I don't really see your point. Anthropic did not mention scrapping, so why would you, as a journalist, bring it up instead of using the correct term.

It's like bringing up your neighbour's dog that bit you when reporting on an article about a new cat disease. Yes, cats and dogs can be pets, but there's no closer relationship.

And if you think web scrapping is legal and almost harmless, may I refer you to all the complaints about "AI scraperd"?

@lnicola @nixCraft There's many ways of scraping. Just because llm companies have aggressive scrapers that disregard robots.txt and don't throttle requests, doesn't mean all scrapers are badly behaved. But generalising from "people complain about llm scrapers" (rightly so) to "all scraping is bad" (I think you may have implied that all scraping breaks copyright as well, which is a non sequitur) is just as sloppy as mixing up scraping with distilling

@alexmu @nixCraft I did not say it's illegal or against copyright, Like Cloudflare puts it, "content theft", "degraded site performance" (https://www.cloudflare.com/learning/ai/how-to-prevent-web-scraping/). It also "wastes application resources, skews analytics, compromises user accounts, and forces developers to build and maintain brittle, custom security logic" (https://www.cloudflare.com/products/bot-mitigation/). I think they're valid concerns, regardless of legality.

None of these applies to Anthropic, so can you explain why scrapping is relevant at all?

@lnicola @nixCraft It's not. You made a point of the difference. You could have ignored the sloppy wording. But you chose to have a go at what you perceived as a post biased against llms.
@alexmu @nixCraft Where does bias come into this? If it reported that Anthropic claims that usage indistinguishable from legitimate as an "attack", would it be biased for or against LLMs?
@nixCraft explains why alibaba's models are way cooler, taking stolen capability from the proprietary hoarderd and giving it back to the public for free where it always belonged
@nixCraft Oh no [the world's tiniest violin starts playing]
@nixCraft it's almost like we need copyright laws to protect information
@nixCraft But wtf just burn down everything already!
@nixCraft Well, obviously. Alibaba is trying to kidnap what Anthropic has rightfully stolen!

@nixCraft "Hello, police? I parked my stolen car in my drive way last night, and I went out this morning and someone had stolen it!!!"

Like a drug dealer calling the police to report a burglary.

@nixCraft "pot calls kettle black"

@nixCraft

The irony of it indeed…

You cannot touch my stolen content because its mine!

@nixCraft Alibaba and the forty thieves 😁
@nixCraft bbc missed an opportunity with that title. "Anthropic, developed by pirating data from writers, complains when other companies collect data on its functionality."

@nixCraft same ol shit

Rules apply to thee not me

@nixCraft Meanwhile I log on to the internet this morning and most of the sites I visit are either down, or making me pass cloudflare checks because of AI scraping, with Anthropic being one of the worst culprits for that.

The absolute nerve of these people.

@nixCraft antropic is doing a full disney here.
@nixCraft Yeah, i was thinking the same. A thief is calling out on another thief.
@nixCraft

"How dare you try to steal what I've rightly stolen!"

Though this sounds like it's less about the content Anthropic scraped than the models they used to process that scraped data.

@nixCraft

'we steal everything, how dare they steal from us...HOW DARE THEY'

*head desk*

@nixCraft I'm still waiting on my check from them. Well, me and about 18,000 other authors.
@nixCraft oh no. Someone used their theft machine to steal from my theft machine