emschwartz an hour ago

Very interesting! I especially appreciated the test of running models against the same benchmark from the following year and the point about the per-token discount being negated by models needing more tokens to get to the answer.

Generalization:

> Maybe Chinese models generalise to unseen tasks less well. (For instance, when tested on fresh data, 01’s Yi model fell 8pp (25%) on GSM - the biggest drop amongst all models.)

> We can get a dirty estimate of this by the “shrinkage gap”: look at how a model performs on next year’s iteration of some task, compared to this year’s. If it finished training in 2024, then it can’t have trained on the version released in 2025, so we get to see what they’re like on at least somewhat novel tasks. We’ll use two versions of the same benchmark to keep the difficulty roughly on par. Let’s try AIME:

> Almost all models get worse on this new benchmark, despite 2025 being the same difficulty as 2024 (for humans). But as I expected, Western models drop less: they lost 10% of their performance on the new data, while Chinese models dropped 21%. p = 0.09.

> Averaging across crappy models for the sake of a cultural generalisation doesn’t make sense. Luckily, rerunning the analysis with just the top models gives roughly the same result (9% gap instead of 11%).

Cost-effectiveness:

> Distinguish intelligence (max performance), intelligence per token (efficiency), and intelligence per dollar (cost-effectiveness).

> The 5x discounts I quoted are per-token, not per-success. If you had to use 6x more tokens to get the same quality, then there would be no real discount. And indeed DeepSeek and Qwen (see also anecdote here about Kimi, uncontested) are very hungry.