Mastodawn

@yogthos That or they believe that you won't have the same capacity or ability to pirate the data required for the models to be useful.

They get a carve out, you get a parrot that still needs training.

@yoasif I mean large open models already exist thanks to Chinese companies releasing them, and we're past the point where shoving more data into models actually improves anything. The future is going to be in architectural improvements.

@yogthos That seems doubtful, with Jensen Huang claiming we've "reached AGI" - the existing models are going to pay off before we see even more investment into something that is just a glint in the eye of most engineers working in this field - it ain't happening without MUCH more investment than we're already getting.

For the existing models, they tend to become less useful as the "state of the art" changes, so ongoing piracy is required (see the deals with Wikipedia from big tech).

@yogthos Training will always be a bottleneck, and ongoing piracy from China isn't guaranteed, especially as they pose a risk to administration affiliated businesses.

See the ban on foreign routers from yesterday: https://www.usatoday.com/story/tech/news/2026/03/24/fcc-bans-new-router-imports/89300646007/

FCC bans imports of new foreign-made routers over security fears

China is estimated to control at least 60% of the U.S. market for home routers, boxes that connect computers, phones and smart devices to the internet.

USA TODAY

@yoasif yet, training can be done differently from the way we do it now. There is already plenty of research, coming out of China incidentally, on how to train models more intelligently. Here's one example https://arxiv.org/abs/2512.24873

Chinese companies don't need to do distillation from US models given that Chinese models are already competitive. Having more data isn't the bottleneck at this point. It's how you analyze the data that matters.

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic model. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-Perceptive Agentic Policy Optimization (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of ALE.

arXiv.org

@yogthos Sorry, how does this not continue to rely on piracy?

> We select approximately one million high-quality GitHub repositories based on criteria such as star counts, fork statistics, and contributor activity. Following Seed-Coder, we concatenate multiple source files within the same repository to form training samples at the project-level code structure, preventing the model from learning only isolated code snippets and promoting understanding of real-world engineering context.

@yoasif first of all, training open repos on GitHub isn't piracy. But the point you've evidently missed is they don't need more data than what they already have available. What the paper actually says is that the structure of the network is what matters. Their innovation is in how relationships in the data are expressed within the model.

@yogthos LOL training open repos isn't piracy?

I'm clearly wasting my time with you.

@yoasif lol clearly you haven't read MIT license or even gPL for that matter, which says that as long as you're making your derivative work open you're compliant. I don't see what you're even trying to say here when talking about open models. You sound confused.

yoasif

@yogthos Sorry, I'm not going to waste my time trying to explain things to you when I'm avoiding working on an explainer post on the topic of LLM piracy and open source.

PS: I am not confused.