Post by Galileo

25,628 followers

You optimized your agent for caching. Your eval pipeline doesn't get that discount. When you call a judge model to evaluate a trace, the provider's cache doesn't apply. Every byte of context gets scanned at full input rate, every single turn. At 5K tokens, the eval cost is already 13x the agent turn cost. At 200K tokens, evaluating one turn costs more than running ten turns of the model itself. The model bill grows slowly because cache reads carry most of it. The eval bill doesn't have that cushion. The 67.6K-token Claude Code turn that costs 2¢ to generate costs 51¢ per metric to evaluate at frontier judge rates. There are three ways out: – Build a custom evaluator that scans only fresh deltas. It works, but it's real engineering work. – Sample 5% of traffic and accept the blind spot. Cheap, but it defeats the point of evaluating production. – Switch to an SLM-based judge where the per-token rate is low enough that scanning full context is fine Read our blog on agent caching and tokenomics here: https://lnkd.in/gV-BZM6M

Post content