StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

https://app.uniclaw.ai/arena?tab=costEffectiveness&via=hn

OpenClaw Arena | UniClaw

A public benchmark for evaluating whether AI agents can complete real workflows. Compare model performance and cost-effectiveness on real agent tasks.

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

OpenClaw | OpenRouter

The AI that actually does things. OpenClaw uses OpenRouter to access hundreds of AI models.

the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.

> the most popular model

It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.

Exactly. When I read the headline I thought: "Ofc it is, its free."
I should have clarified I didn't use the free version...
It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

StepFun is an interesting model.

If you haven’t heard of it yet there’s some good discussion here:
https://news.ycombinator.com/item?id=47069179

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed | Hacker News

Since that discussion, they released the base model and a midtrain checkpoint:

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...

I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.

They also released the entire training pipeline:

- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...

- https://github.com/stepfun-ai/SteptronOss

stepfun-ai/Step-3.5-Flash-Base · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks
Yet when I tried it it did absymal compared to Gemini 2.5 Flash
what kind of tasks did you try?
Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.