Post by Turing
2,058,849 followers
Benchmark scores are climbing. But are AI models actually improving scientific work? At [ICML] Int'l Conference on Machine Learning 2026 in Seoul, Turing's Charlotte Tao and Tristan Tager will take on one of the most important questions in frontier AI evaluation: the growing disconnect between what benchmarks measure and what scientists actually need. SciCode rose from 4.6% to 59% in a year. HLE went from 8% to 47%. But scientific work is not paper progress. Through real-world examples from Turingโs frontier data work, Charlotte and Tristan will examine why models can appear strong on isolated tasks while still struggling with full scientific workflows. Attendees will leave with a practical lens for distinguishing benchmark progress from workflow progress and what scientific AI evaluations need to measure next. The talk will also explore what comes next for frontier data: how the field may evolve from datasets toward modular, composable capability infrastructure. Proud to see Turing contributing to this conversation at ICML 2026. ๐๐๐ฏ๐๐ง๐๐ข๐ง๐ ๐ ๐ซ๐จ๐ง๐ญ๐ข๐๐ซ ๐๐๐ข๐๐ง๐ญ๐ข๐๐ข๐ ๐๐๐ฉ๐๐๐ข๐ฅ๐ข๐ญ๐ข๐๐ฌ, ๐๐จ๐๐๐ฒ ๐๐ง๐ ๐๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ ๐๐ก๐๐ซ๐ฅ๐จ๐ญ๐ญ๐ ๐๐๐จ ๐๐ง๐ ๐๐ซ๐ข๐ฌ๐ญ๐๐ง ๐๐๐ ๐๐ซ, ๐๐ฎ๐ซ๐ข๐ง๐ ๐๐๐๐ 2026, ๐๐๐จ๐ฎ๐ฅ, ๐๐ฎ๐ฅ๐ฒ 6-11.