Post by Turing

2,058,849 followers

Benchmark scores are climbing. But are AI models actually improving scientific work? At [ICML] Int'l Conference on Machine Learning 2026 in Seoul, Turing's Charlotte Tao and Tristan Tager will take on one of the most important questions in frontier AI evaluation: the growing disconnect between what benchmarks measure and what scientists actually need. SciCode rose from 4.6% to 59% in a year. HLE went from 8% to 47%. But scientific work is not paper progress. Through real-world examples from Turingโ€™s frontier data work, Charlotte and Tristan will examine why models can appear strong on isolated tasks while still struggling with full scientific workflows. Attendees will leave with a practical lens for distinguishing benchmark progress from workflow progress and what scientific AI evaluations need to measure next. The talk will also explore what comes next for frontier data: how the field may evolve from datasets toward modular, composable capability infrastructure. Proud to see Turing contributing to this conversation at ICML 2026. ๐€๐๐ฏ๐š๐ง๐œ๐ข๐ง๐  ๐…๐ซ๐จ๐ง๐ญ๐ข๐ž๐ซ ๐’๐œ๐ข๐ž๐ง๐ญ๐ข๐Ÿ๐ข๐œ ๐‚๐š๐ฉ๐š๐›๐ข๐ฅ๐ข๐ญ๐ข๐ž๐ฌ, ๐“๐จ๐๐š๐ฒ ๐š๐ง๐ ๐“๐จ๐ฆ๐จ๐ซ๐ซ๐จ๐ฐ ๐‚๐ก๐š๐ซ๐ฅ๐จ๐ญ๐ญ๐ž ๐“๐š๐จ ๐š๐ง๐ ๐“๐ซ๐ข๐ฌ๐ญ๐š๐ง ๐“๐š๐ ๐ž๐ซ, ๐“๐ฎ๐ซ๐ข๐ง๐  ๐ˆ๐‚๐Œ๐‹ 2026, ๐’๐ž๐จ๐ฎ๐ฅ, ๐‰๐ฎ๐ฅ๐ฒ 6-11.

Post content