Post by Ravit Dennis

Principal AI group Engineer at Microsoft

I’m seeing more and more “AI products” built in a day. A quick demo, a slick UI, a few prompts — and suddenly it looks like a solved problem. And honestly? That’s part of the magic of this moment. But it’s also creating a dangerous illusion. Because what works in a demo is very far from what works in production. The real work starts where the demo ends: - How accurate is it across real user scenarios? - What happens when it’s wrong? - How do you measure quality in a way that isn’t subjective? - Can you catch regressions before your customers do? - What does this cost at scale when tokens aren’t cheap anymore? In my experience, the gap comes down to three things: 1. Evals are the product If you don’t have a structured way to measure quality (datasets, metrics, LLM-judge, comparisons), you’re not improving — you’re guessing. 2. Reliability > First Impression A system that works 70% of the time in a demo is exciting. A system that works 95% of the time in production is hard. 3. Cost is part of the architecture Token usage, latency, parallelism — these aren’t optimizations. They shape what’s even feasible to build. The takeaway I keep coming back to: Anyone can build a demo in a day. Very few build systems that hold up a month later. Curious how others are dealing with this gap between “demo velocity” and “production reality”. #AI #LLM #Engineering #MLOps #AIInfrastructure