Shuo Qiu’s Notes on AI Agent Evaluation

Opinion·

Your Agent and Harness Aren't the Asset, Your Eval Is

The durable asset in agent development is not the prompt or the harness. It is the eval: the specification of what good looks like, where agents fail, and what customers actually need.

Benchmark·

Comparing Ways to Create Eval Test Cases: Skip Frameworks, Just Prompt

We tested DeepEval's Synthesizer and Promptfoo-style templates against plain prompts across 3 LLMs. Neither framework helped — a basic prompt matched or beat both every time. What actually moved the needle was which model you used to generate.