1 issue filed under #products.
Agent benchmarks finally start measuring something useful, three open-weights releases tighten the gap with frontier closed models, and a quiet paper from DeepMind reframes how to think about evals.
← BACK TO ARCHIVE