The 60-second read
- // 01Anthropic and OpenAI both shipped agent-focused evals this week — the first that score on multi-step task completion rather than single-turn answers.→
- // 02Three open-weights model releases (a 70B reasoning model, a 32B coder, a 7B speech) all close meaningfully on closed-model benchmarks at a fraction of the cost.→
- // 03DeepMind's new paper argues most evals are measuring memorization, not capability, and proposes a framework for capability-isolating tests.→
- // 04Two well-funded agent startups quietly pivoted from 'autonomous everything' to narrow vertical workflows — a pattern worth watching.→
- // 05Vercel and Cloudflare both shipped durable-execution primitives for agent workflows; the infra layer for agents is converging.→
For builders
- 01OPENAI
OpenAI's Agents SDK 1.0 ships durable runs and structured handoffs
First-class durable execution, structured handoffs between agents, and tracing that maps cleanly to OpenTelemetry. If you're building multi-step agents, the ergonomics here are worth a serious look — especially the handoff primitive, which removes a lot of glue code.
- 02LATENT SPACE
A practical eval harness for tool-using agents
Walks through building task-completion evals for agents that use tools, with concrete code. The framing — score the trajectory, not just the final answer — is the right mental model for anyone shipping agents to production.
- 03HUGGING FACE
Llama-Reasoner-70B: open weights, frontier-adjacent reasoning
Closes most of the gap with the leading closed reasoning models on math and code benchmarks, MIT-licensed, and runs on a single H100 with quantization. The cost-per-token math now favors self-hosting for many reasoning workloads.
- 04CLOUDFLARE
Cloudflare Workflows hits GA with agent-shaped APIs
Durable, step-based execution at the edge, with first-class support for long-running LLM calls and human-in-the-loop steps. Worth comparing against Inngest and Temporal if you're picking infra for an agent product this quarter.
- 05ANTHROPIC
How to actually measure agent reliability in production
Engineering post on the metrics that matter once an agent is live: task completion rate, intervention rate, time-to-recovery from a stuck state. Pragmatic and refreshingly honest about how often agents get stuck.
Deep dive
Are we measuring capability, or memorization?
DeepMind takes aim at the quiet rot in modern model benchmarks — and proposes a framework for telling real progress from contamination.
DeepMind's new paper, Capability-Isolating Evaluation, takes aim at the quiet rot in modern model benchmarks: as training corpora absorb every public test set, scores increasingly reflect leakage rather than ability. The authors propose a methodology that constructs evals from procedurally generated tasks with no public-text exposure, then validates that frontier models drop substantially on these tests relative to their public-benchmark performance.
The interesting move isn't the benchmark itself — others have tried procedural generation before — but the framework for deciding whether a benchmark is leak-resistant. They define three properties (novelty, isomorphism resistance, and grounding) and offer a checklist you can run against any eval suite before trusting its scores.
Most public benchmarks you cite in pitch decks are probably contaminated. Build a small private eval set, version it, and treat it as the source of truth.
For practitioners, the practical takeaway is uncomfortable: most public benchmarks you cite in pitch decks are probably contaminated. The pragmatic response is to build a small private eval set tied directly to your product's tasks, version it, and treat it as the source of truth — public benchmarks become a sanity check, not a north star.
For researchers, the paper is an invitation to a more honest era of measurement. Expect a wave of follow-on work redefining what 'state of the art' means once you control for leakage.
Everything else
- MISTRALMistral releases a permissive 32B coder — Apache-licensed, beats prior open coding models on HumanEval+ and SWE-bench Lite.#models#opensource
- HUGGING FACEA 7B speech model that runs on a phone — On-device transcription and synthesis quality jumps materially. Real-time voice agents on mobile are now plausible.#models#opensource#voice
- POLITICOEU AI Act: first compliance deadline lands this month — Provider-of-GPAI obligations kick in. Quick legal-team primer linked.#policy
- HACKER NEWSThe narrow-vertical agent thesis — Two YC-backed agent companies pivot from horizontal to vertical. Recurring pattern across the cohort.#products#agents
- ANTHROPICA practical guide to prompt caching — Concrete patterns and anti-patterns. If you're making more than ~10 calls per session with shared context, you're probably leaving money on the table.#infra#tooling