Weekly AI Signals

// 01 ·

The 60-second read

// 01Anthropic and OpenAI both shipped agent-focused evals this week — the first that score on multi-step task completion rather than single-turn answers.→
// 02Three open-weights model releases (a 70B reasoning model, a 32B coder, a 7B speech) all close meaningfully on closed-model benchmarks at a fraction of the cost.→
// 03DeepMind's new paper argues most evals are measuring memorization, not capability, and proposes a framework for capability-isolating tests.→
// 04Two well-funded agent startups quietly pivoted from 'autonomous everything' to narrow vertical workflows — a pattern worth watching.→
// 05Vercel and Cloudflare both shipped durable-execution primitives for agent workflows; the infra layer for agents is converging.→

// 02 ·

For builders

01
OPENAI
OpenAI's Agents SDK 1.0 ships durable runs and structured handoffs
First-class durable execution, structured handoffs between agents, and tracing that maps cleanly to OpenTelemetry. If you're building multi-step agents, the ergonomics here are worth a serious look — especially the handoff primitive, which removes a lot of glue code.
#agents #infra #tooling
02
LATENT SPACE
A practical eval harness for tool-using agents
Walks through building task-completion evals for agents that use tools, with concrete code. The framing — score the trajectory, not just the final answer — is the right mental model for anyone shipping agents to production.
#agents #evals #tooling
03
HUGGING FACE
Llama-Reasoner-70B: open weights, frontier-adjacent reasoning
Closes most of the gap with the leading closed reasoning models on math and code benchmarks, MIT-licensed, and runs on a single H100 with quantization. The cost-per-token math now favors self-hosting for many reasoning workloads.
#models #opensource #infra
04
CLOUDFLARE
Cloudflare Workflows hits GA with agent-shaped APIs
Durable, step-based execution at the edge, with first-class support for long-running LLM calls and human-in-the-loop steps. Worth comparing against Inngest and Temporal if you're picking infra for an agent product this quarter.
#infra #agents
05
ANTHROPIC
How to actually measure agent reliability in production
Engineering post on the metrics that matter once an agent is live: task completion rate, intervention rate, time-to-recovery from a stuck state. Pragmatic and refreshingly honest about how often agents get stuck.
#agents #evals

// 03 ·

Deep dive

// DEEP DIVE · ESSAY

Are we measuring capability, or memorization?

DeepMind takes aim at the quiet rot in modern model benchmarks — and proposes a framework for telling real progress from contamination.

DeepMind's new paper, Capability-Isolating Evaluation, takes aim at the quiet rot in modern model benchmarks: as training corpora absorb every public test set, scores increasingly reflect leakage rather than ability. The authors propose a methodology that constructs evals from procedurally generated tasks with no public-text exposure, then validates that frontier models drop substantially on these tests relative to their public-benchmark performance.

The interesting move isn't the benchmark itself — others have tried procedural generation before — but the framework for deciding whether a benchmark is leak-resistant. They define three properties (novelty, isomorphism resistance, and grounding) and offer a checklist you can run against any eval suite before trusting its scores.

Most public benchmarks you cite in pitch decks are probably contaminated. Build a small private eval set, version it, and treat it as the source of truth.

For practitioners, the practical takeaway is uncomfortable: most public benchmarks you cite in pitch decks are probably contaminated. The pragmatic response is to build a small private eval set tied directly to your product's tasks, version it, and treat it as the source of truth — public benchmarks become a sanity check, not a north star.

For researchers, the paper is an invitation to a more honest era of measurement. Expect a wave of follow-on work redefining what 'state of the art' means once you control for leakage.

#research #evals #papers

// 04 ·

Everything else

MISTRAL
Mistral releases a permissive 32B coder — Apache-licensed, beats prior open coding models on HumanEval+ and SWE-bench Lite.#models #opensource
HUGGING FACE
A 7B speech model that runs on a phone — On-device transcription and synthesis quality jumps materially. Real-time voice agents on mobile are now plausible.#models #opensource #voice
POLITICO
EU AI Act: first compliance deadline lands this month — Provider-of-GPAI obligations kick in. Quick legal-team primer linked.#policy
HACKER NEWS
The narrow-vertical agent thesis — Two YC-backed agent companies pivot from horizontal to vertical. Recurring pattern across the cohort.#products #agents
ANTHROPIC
A practical guide to prompt caching — Concrete patterns and anti-patterns. If you're making more than ~10 calls per session with shared context, you're probably leaving money on the table.#infra #tooling

Curated, skeptical, short.

The week the agents grew up

The 60-second read

For builders

OpenAI's Agents SDK 1.0 ships durable runs and structured handoffs

A practical eval harness for tool-using agents

Llama-Reasoner-70B: open weights, frontier-adjacent reasoning

Cloudflare Workflows hits GA with agent-shaped APIs

How to actually measure agent reliability in production

Deep dive

Are we measuring capability, or memorization?

Everything else

One issue every Monday. No filler.