Mastodawn

StepFun 3.5 Flash is #1 cost-effective model for OpenClaw tasks (300 battles)

https://app.uniclaw.ai/arena?tab=costEffectiveness&via=hn

OpenClaw Arena | UniClaw

A public benchmark for evaluating whether AI agents can complete real workflows. Compare model performance and cost-effectiveness on real agent tasks.

Show thread

hadlock Apr 1

According to openrouter.ai it looks like StepFun 3.5 Flash is the most popular model at 3.5T tokens, vs GLM 5 Turbo at 2.5T tokens. Claude Sonnet is in 5th place with 1.05T tokens. Which isn't super suprising as StepFun is ~about 5% the price of Sonnet.

https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F

OpenClaw | OpenRouter

The AI that actually does things. OpenClaw uses OpenRouter to access hundreds of AI models.

Show thread

skysniper Apr 1

the real surprising part to me is that, despite being the cheapest model on board, stepfun is often able to score high at pure performance. Other models at the same price range (e.g. kimi) fails to do that.

Show thread

NitpickLawyer Apr 1

> the most popular model

It was free for a long time. That usually skews the statistics. It was the same with grok-code-fast1.

Show thread

MaxikCZ Apr 1

Exactly. When I read the headline I thought: "Ofc it is, its free."

Show thread

skysniper Apr 1

I should have clarified I didn't use the free version...

Show thread

smallerize Apr 1

It looks like Unsloth had trouble generating their dynamic quantized versions of this model, deleted the broken files, then never published an update.

Show thread

WhitneyLand Apr 1

StepFun is an interesting model.

If you haven’t heard of it yet there’s some good discussion here:
https://news.ycombinator.com/item?id=47069179

Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed | Hacker News

Show thread

tarruda Apr 1

Since that discussion, they released the base model and a midtrain checkpoint:

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base

- https://huggingface.co/stepfun-ai/Step-3.5-Flash-Base-Midtra...

I'm not aware of other AI labs that released base checkpoint for models in this size class. Qwen released some base models for 3.5, but the biggest one is the 35B checkpoint.

They also released the entire training pipeline:

- https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SF...

- https://github.com/stepfun-ai/SteptronOss

stepfun-ai/Step-3.5-Flash-Base · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Show thread

skysniper Apr 1

thanks for the info. before running the bench i only tried it in arena.ai type of tasks and it was not impressive. i didn't expect it to be that good at agentic tasks

Show thread

grimm8080 Apr 1

Yet when I tried it it did absymal compared to Gemini 2.5 Flash

Show thread

skysniper Apr 1

what kind of tasks did you try?

Show thread

sunaookami Apr 1

Tried the free version on OpenRouter with pi.dev and it's competent at tool calling and creative writing is "good enough" for me (more "natural Claude-level" and not robotic GPT-slop level) but it makes some grave mistakes (had some Hanzi in the output once and typos in words) so it may be good with "simple" agentic workflows but it's definitely not made for programming nor made for long writing.