Mastodawn

Andrei Kucharavy Sep 2, 2025

Fully open, multilingual and privacy-first SotA 8B and 70B LLMs from SwissAI Initiative academic consortium led by @EPFL ETHZ, and CSCS, Apache 2.0.

You can:

-Try it here: https://publicai.co/

- Run it locally on HF Transformers: https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509

- Read the technical report: https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf

1/🧵

Public AI Inference Utility

A nonprofit, open-source service to make public and sovereign AI models more accessible.

PublicAI

Show thread

Andrei Kucharavy Sep 2, 2025

And now - as a member of the Apertus team (Security and Safety Coordinator), what we are really proud of:

- Respect of the data owner consent. The pretraining data was sourced from the 100% open Fine Web 2 (https://arxiv.org/abs/2506.20920), then all source domains were scanned for robots.txt and removed from the pretraining dataset if **ANY** LLM crawlers were excluded.

- Respect of the private data. We additionally filtered the data for PPI and used training tricks to prevent memorization.

2/🧵

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training performant multilingual LLMs remains a challenge, in large part due to the inherent difficulty of tailoring filtering and deduplication pipelines to a large number of languages. In this work, we introduce a new pre-training dataset curation pipeline based on FineWeb that can be automatically adapted to support any language. We extensively ablate our pipeline design choices on a set of nine diverse languages, guided by a set of meaningful and informative evaluation tasks that were chosen through a novel selection process based on measurable criteria. Ultimately, we show that our pipeline can be used to create non-English corpora that produce more performant models than prior datasets. We additionally introduce a straightforward and principled approach to rebalance datasets that takes into consideration both duplication count and quality, providing an additional performance uplift. Finally, we scale our pipeline to over 1000 languages using almost 100 Common Crawl snapshots to produce FineWeb2, a new 20 terabyte (5 billion document) multilingual dataset which we release along with our pipeline, training, and evaluation codebases.

arXiv.org

Show thread

Andrei Kucharavy Sep 2, 2025

- Truly multilingual performance evaluation and reporting. In addition to the standard benchmarks we included a diverse set of language-specific benchmarks.

- In-depth Safety and Security testing. In addition to standard benchmarks we also run spot-evaluations for specific issues that could have blocked the model release, evaluated the model for disinformation suitability and detectability to make sure it could not be used for disinformation.

3/🧵

Show thread

Andrei Kucharavy Sep 2, 2025

So do LLMs work without stolen data?

You bet they do!

It actually does way better than most other models on international cultural knowledge!

4/🧵

Show thread

Andrei Kucharavy

So stop ripping the web apart with your crawlers and git gud at using the data you have.

5/🧵

Show thread

Andrei Kucharavy Sep 2, 2025

So is there anything Apertus isn't good at?

Hint: This image is supposed to be a pelican riding a bicycle (kudos to @simon for a pretty intricate model code-writing test).

6/🧵

Show thread

Andrei Kucharavy Sep 2, 2025

A -Thinking version with RL to improve code generation and answer quality through scaffolding ("thinking") is in the pipelines, as well as future model iterations!

Want to help out with it?

Report the issues with the model output you find as issues on the official generation bugs repo: https://github.com/swiss-ai/Apertus-Generation-Issues-Reports

7/🧵

GitHub - swiss-ai/Apertus-Generation-Issues-Reports: This is the repository for reporting issues with the SwissAI Apertus Model family generation

This is the repository for reporting issues with the SwissAI Apertus Model family generation - swiss-ai/Apertus-Generation-Issues-Reports

GitHub

Show thread

Andrei Kucharavy Sep 2, 2025

Finally, if you want to run the model locally on ollama / llama.cpp, we are working on it...

While the GGUF files are ready, our model uses a xIELU activation function, which is not yet implemented in llama.cpp or ollama, and on which we are working right now.

8/🧵