American AI startup Arcee has released its open-source reasoning model Trinity-Large-Thinking. It scored 91.9 on the Agent capability benchmark PinchBench, trailing only Opus 4.6’s 93.3, and on the Tau2-Airline Agent task benchmark it took the highest score among all compared models with 88.0. The model uses a 400B sparse mixture-of-experts architecture. Its API pricing is $0.90 per million tokens output, about 96% cheaper than Opus 4.6. The weights are available for download under the Apache 2.0 license. Compiled by 動區動趨.
(Backgrounder: OpenRouter analysis of a 100 trillion token research report—what exactly do humans use AI for, China’s model rise, and the secret behind user retention)
(Additional context: Claude Opus 4.6 is here: it writes compilers, makes PPTs, and casually uncovers 500 zero-day vulnerabilities—your job, it wants to try everything).
Arcee, the American AI startup with fewer than 100 employees, turned in an Agent evaluation score that tightly matches Anthropic’s flagship model—while its price is only 4% of theirs.
Arcee wasn’t a mainstream focus point in the past, but its newly released Trinity-Large-Thinking has already pushed into the top tier across multiple Agent scenario benchmarks.
PinchBench, developed by Kilo, is currently an important industry yardstick for measuring a model’s real-world capability in Agent workflows. In this test, Trinity-Large-Thinking scored 91.9, while the current reigning king, Opus 4.6, scored 93.3—only a 1.4% gap.
In another benchmark simulating real customer service scenarios, Tau2-Airline, it achieved an even higher 88.0, beating every compared model involved. This suggests that in practical Agent tasks requiring multi-round dialogue and repeated tool lookups, this open-source model truly performs at a very high level.
Meanwhile, Arcee’s API pricing is $0.90 per million tokens output. The company says this is about 96% cheaper than Opus 4.6. For application scenarios that need Agents to automatically run for long periods and continuously consume tokens, the cost difference may be more meaningful than the model score gap.
According to Arcee AI’s official blog, the key to this kind of cost-performance value lies in the architectural choice. Trinity-Large-Thinking uses a sparse MoE (mixture of experts) design: it includes 256 expert modules, but for each token it only activates 4 of them. That means the massive 400B model only needs a 13B compute burden during real inference, with an execution efficiency roughly 2–3 times that of dense models at the same scale.
Compared with the predecessor Preview released at the end of January this year, the biggest upgrade is the addition of an inference thinking chain.
Preview only did instruction fine-tuning. This Thinking version will “think first” before answering, with clear improvements in stability for multi-round tool calls and coherence over long context. Arcee puts it plainly: this model is designed so it won’t crash during long-running Agent loops.
The overall base model used $20 million and took 33 days to train; the post-training for the Thinking version took another 9 months of refinement.
Arcee CEO Lucas Atkins wrote in the release post: “Getting here took difficult technical work, hard calls…Nobody did that. They kept pushing.”
Of course, specialization in Agents also comes with trade-offs. On general reasoning benchmarks, Trinity-Large-Thinking’s results aren’t nearly as impressive. It scored 76.3 on GPQA-D, while Kimi K2.5 scored 86.9 and Opus 4.6 scored 89.2—gaps of 10 and 13 percentage points, respectively; and MMLU-Pro’s 83.4 is also at the bottom among the compared models.
But Arcee doesn’t seem to plan to compete head-on in this direction. Officially, it says that “Trinity-Large-Thinking is the strongest open-source model outside China across many dimensions,” and they’ve said their opponents aren’t Opus or GPT, but the China open-source camp such as DeepSeek and Kimi.
Trinity-Large-Thinking has also been listed on OpenRouter, and for its first 5 days it has been available for free use in OpenClaw. The prior Preview will continue to be offered for free as well.
As for the previous Preview version, since it went live at the end of January, it has accumulated more than 3.37 trillion tokens processed on the OpenRouter platform. In OpenClaw’s statistics, it is the #1 open-source model by usage in the United States and the #4 globally. For a not-so-large startup, this adoption rate already proves that it’s cheap and easy to use, and that there is real market demand.
Model weights are published on Hugging Face under the Apache 2.0 license, and anyone can download, modify, and deploy them commercially.