Shuo Qiu’s Notes

Jul 12, 2026

How We Learned to Measure Coding Agents, and Why We Still Can't

Nobody can tell you which coding agent is best, and a bigger benchmark won't fix that. Every evaluation decides what to test, how to grade it, and which tasks to sample, and each one covers only a narrow aspect. What works instead is assembling several imperfect measurements and knowing each one's blind spot.

Apr 26, 2026

Don’t Shop for Evaluators. Let Your Coding Agent Build One.

Don't shop for pre-built LLM judges. Have a coding agent read your real task material (code, docs, traces) and write the judge. It's faster than shopping, and on tau-bench telecom agreement with ground truth more than doubled.

Apr 5, 2026

Your Agent and Harness Aren't the Asset, Your Eval Is

The durable asset in agent development is not the prompt or the harness. It is the eval: the specification of what good looks like, where agents fail, and what customers actually need.