From 0.5 to 2

I think product development has a phase where eval is extremely valuable, and a phase where it is much less valuable. The mistake many teams make is either starting too early, before they have anything stable enough to evaluate, or continuing too long, after real user data has become the better source of truth.

The way I think about it is a path from 0 to 2.

The Window

From 0 to 0.5, you are still exploring. You are trying to find out whether the product can work at all. You are building the agentic core loop, the system prompt, the tools, the UX, the frontend and backend glue, and the basic product idea. At this stage, the main question is not “how do we measure quality at scale?” The question is “does this thing work at all?” Evals are usually not the bottleneck here. Product understanding is.

From 0.5 to 1, the direction is clearer. You have something that basically works. The core workflow is fairly stable, even if everything around it is still changing quickly. This is where eval starts to matter. You are still pre-launch, you do not have real user traffic yet, and manual testing is no longer enough. This is the point where you should invest in simulation-based eval so you can test changes across ten, twenty, or fifty cases instead of only checking a few examples by hand.

At 1, you ship. That is the first real contact with customers.

From 1 to 2, eval is still important, but the role changes. You now have some real user behavior, but not enough volume or stability to rely entirely on production data. The product is still evolving quickly. You are still changing prompts, tools, policies, UI flows, and backend logic. In this phase, eval helps you answer a practical question before or after each change: did this improve the product, or did it introduce a regression? Even after launch, simulation still matters because it gives you a controlled way to compare versions while the product is moving.

To me, 2 is the maturity point where the product has a clear user base, the core experience is relatively stable, and real-world observability becomes the primary signal. Once you have enough real traffic, enough monitoring, and enough clarity about how users actually behave, the return on further investment in simulation-based eval starts to fall off. At that point, you should not keep building eval infrastructure just because it feels rigorous. You should follow the highest-signal feedback loop, and often that means production metrics, traces, support signals, and direct user behavior.

That is why I call this post “From 0.5 to 2.” I think that is the window where eval has the highest leverage.

Why Taste Matters

There is another reason this window matters: developer taste.

Today, coding agents can help you set up an eval pipeline very quickly. A capable agent can scaffold a dataset, a runner, a judging step, and a report in fifteen or thirty minutes. That speed is useful, but it can also hide the real difficulty. The hard part is not producing an eval pipeline. The hard part is deciding what the eval should measure, what failure modes matter, what the right distribution looks like, and what level of fidelity is necessary to make the result trustworthy.

That is where taste comes in.

Good developers know where the product is fragile. They know which metrics are vanity metrics and which ones actually map to user value. They know when a simple deterministic check is better than an LLM judge. They know when a synthetic dataset is good enough and when they need real user traces. They know when to add more eval coverage and when to stop.

Without that taste, a fast eval setup can become a very polished distraction. You get dashboards, scores, and automation, but the system is not actually helping product development. It is just measuring something convenient.

This is why I think developers should own eval design. The person closest to the product usually has the best sense of what “good” looks like and what kinds of failures actually matter. That does not mean every developer will design a perfect eval. Blind spots are real. But it is still better than outsourcing the design of the eval to someone who does not understand the product deeply enough to choose the right target.

From Synthetic to Real

The data source should evolve as the product evolves too.

Before launch, you mostly have imagination, specs, and synthetic cases. That is fine. You use what you have.

After launch, you should move steadily toward real user behavior. The progression is simple: imagination, then synthetic inputs, then real inputs. Each step gets your eval closer to the true distribution of user behavior.

And eventually, if the product succeeds, you should let real user data take over.

The Point of 2

I do not think the goal is to build the biggest or fanciest eval system. The goal is to invest in eval when it has the highest leverage, which is usually between 0.5 and 2. Before that, you are still discovering the product. After that, the market is telling you the truth more directly than any simulation can.