Post by Turing
2,101,939 followers
Top AI agents fail enterprise workflows 63% of the time.Here's why it matters. ServiceNow partnered with Turing to build EnterpriseOps-Gym, a benchmark designed around how enterprise work actually happens. Turing contributed 1,000+ prompts across HR, ITSM, CSM, Email, Calendar, Drive, Teams, and hybrid workflows. Tasks ran 7 to 30 steps with real policy constraints, evaluated by deterministic verifier scripts checking actual system state, not just output quality. The top frontier model hit only 37.4% task completion. Giving agents human-authored plans improved that by 14 to 35 percentage points, which means planning is the bottleneck, not capability. If you're deploying enterprise agents and haven't tested them on long-horizon, stateful workflows, your evals are leaving blind spots. Full case study: https://lnkd.in/gK-DBss5