Post by Somya Sinha
Building AI for consulting firms | Ex-Bain
Google launched Gemini 3 yesterday to much fanfare. Yes, it beats Claude and GPT on a lot of AI benchmarks by a wide margin. But it failed my personal benchmark. Take a look at the simple logical puzzle I threw its way about Amazon mixing up my left and right shoes (which was actually not an issue 😆). Gemini 3 got it wrong. What does this mean for you? Benchmarks are how AI labs beat each other measurably in the public eye, get eyeballs, get traction. But it may not necessarily mean the model is better for you. Why does this happen? Monalisa Sethi and I wrote about it in our last blog in detail. The short version: AI labs get rewarded for high scores on standardized tests. Models that say "I don't know" score zero points. And there are no negative marks. So, guessing >> saying "I don't know". That's why models learn to be confidently wrong. This incentive structure creates the gap you see between benchmark performance and real-world reasoning. If you're trying to implement AI at work, this matters. You need strategies to work around hallucinations, not hope they'll magically disappear. Check out the blog and subscribe for nuanced insights on AI that peel the (AI) onion.