A weekly read of what changed in AI

Curated, skeptical, short.

Each Monday: a tiered read of the week's most consequential research, launches, and shifts. Built for executives skimming, builders shipping, and practitioners going deep.

Free · No tracking · One-click unsubscribe

Transmission
// 001
Filed
04.05.2026
Read time
6 min
Status
CURATED

The week the agents grew up

> Agent benchmarks finally start measuring something useful, three open-weights releases tighten the gap with frontier closed models, and a quiet paper from DeepMind reframes how to think about evals.

~/issues/2026-05-04 — bash
signals@weekly ~ $ cat ./issue-meta.json{ "transmission": "001", "filed": "04.05.2026", "read_min": 6 }signals@weekly ~ $ run-pipeline --status sources ingested · ranked · deduped human review: OK ready for readerssignals@weekly ~ $

Published May 4, 2026.

// 01 ·

The 60-second read

  1. // 01Anthropic and OpenAI both shipped agent-focused evals this week — the first that score on multi-step task completion rather than single-turn answers.
  2. // 02Three open-weights model releases (a 70B reasoning model, a 32B coder, a 7B speech) all close meaningfully on closed-model benchmarks at a fraction of the cost.
  3. // 03DeepMind's new paper argues most evals are measuring memorization, not capability, and proposes a framework for capability-isolating tests.
  4. // 04Two well-funded agent startups quietly pivoted from 'autonomous everything' to narrow vertical workflows — a pattern worth watching.
  5. // 05Vercel and Cloudflare both shipped durable-execution primitives for agent workflows; the infra layer for agents is converging.
// 02 ·

For builders

// 03 ·

Deep dive

// DEEP DIVE · ESSAY

Are we measuring capability, or memorization?

DeepMind takes aim at the quiet rot in modern model benchmarks — and proposes a framework for telling real progress from contamination.

DeepMind's new paper, Capability-Isolating Evaluation, takes aim at the quiet rot in modern model benchmarks: as training corpora absorb every public test set, scores increasingly reflect leakage rather than ability. The authors propose a methodology that constructs evals from procedurally generated tasks with no public-text exposure, then validates that frontier models drop substantially on these tests relative to their public-benchmark performance.

The interesting move isn't the benchmark itself — others have tried procedural generation before — but the framework for deciding whether a benchmark is leak-resistant. They define three properties (novelty, isomorphism resistance, and grounding) and offer a checklist you can run against any eval suite before trusting its scores.

Most public benchmarks you cite in pitch decks are probably contaminated. Build a small private eval set, version it, and treat it as the source of truth.

For practitioners, the practical takeaway is uncomfortable: most public benchmarks you cite in pitch decks are probably contaminated. The pragmatic response is to build a small private eval set tied directly to your product's tasks, version it, and treat it as the source of truth — public benchmarks become a sanity check, not a north star.

For researchers, the paper is an invitation to a more honest era of measurement. Expect a wave of follow-on work redefining what 'state of the art' means once you control for leakage.

// 04 ·

Everything else

Subscribe

One issue every Monday. No filler.

A weekly read of what changed in AI — written for executives, builders, and practitioners. Free, no tracking pixels, unsubscribe with one click.

POST /api/subscribe
// END OF TRANSMISSION━━━ ✦ ━━━Weekly AI Signals · TX · 2026
ArchiveAboutRSSInstrument · IBM Plex Mono