Post by Grafana Labs

290,527 followers

🤖 Evaluating an AI agent isn't just "does the answer look right?" A response can be well-written and still be built on the wrong query. A dashboard panel can render successfully and still tell the user nothing. The assistant can appear to follow a task while making a bad decision somewhere in the middle. That's the problem we had to solve for Grafana Assistant. The system can touch almost every corner of Grafana Cloud — querying metrics, exploring logs and traces, building dashboards, navigating the UI. Which means a single prompt tweak can clean up formatting while quietly hurting tool choice. A cost win can hide a task-quality regression. And because the system is non-deterministic, the same request can succeed once and fail the next. So they built an evaluation loop. Not just to score the assistant, but to make change visible and tradeoffs explicit: what improved, what regressed, what it cost in latency or dollars. The core insight: stop treating the prompt as the source of truth. Write scenarios instead — real user workflows with explicit success criteria. Then write graders that check the full trajectory: which tools were called, what arguments were passed, how results were interpreted, what Grafana state was saved. The final answer is only part of the behavior. The rest is in the steps that produced it.