Post by Samaya AI

5,581 followers

An agent can have access to all the right context but still call the wrong tools, leading to the wrong answer. At Samaya AI, we see this constantly: when a wide range of tools are available to the agent, calling the best tools at every step can be hard, because tools can provide overlapping functions, cover incomplete results, have behavioral nuances, or simply, having outdated documentation. Working with researchers from USC, we wanted to quantify this problem with our new benchmark, OpaqueToolsBench. We further built ToolObserver: an approach where agents learn tool behavior from their own execution traces rather than static documentation. The docs update iteratively until they reflect how tools actually behave in practice. On OpaqueToolsBench, ToolObserver outperformed state-of-the-art methods by 18.6%, using 3.5–7.5x fewer tokens. Catch the presentation at ACM CAIS 2026 today (talk at 1:50 PM, poster at 5:15 PM). Blog link in the comments. Paper and code are public. What's the worst gap you've seen between what a tool claimed to do and what it actually did? Special thanks to: Skyler Hallinan, Thejas Venkatesh, Yuhao Zhang, Ashwin Paranjape, Sean (Xiang) Ren