Post by Turing

2,058,849 followers

Benchmark scores are climbing. But are AI models actually improving scientific work? At [ICML] Int'l Conference on Machine Learning 2026 in Seoul, Turing's Charlotte Tao and Tristan Tager will take on one of the most important questions in frontier AI evaluation: the growing disconnect between what benchmarks measure and what scientists actually need. SciCode rose from 4.6% to 59% in a year. HLE went from 8% to 47%. But scientific work is not paper progress. Through real-world examples from Turing’s frontier data work, Charlotte and Tristan will examine why models can appear strong on isolated tasks while still struggling with full scientific workflows. Attendees will leave with a practical lens for distinguishing benchmark progress from workflow progress and what scientific AI evaluations need to measure next. The talk will also explore what comes next for frontier data: how the field may evolve from datasets toward modular, composable capability infrastructure. Proud to see Turing contributing to this conversation at ICML 2026. 𝐀𝐝𝐯𝐚𝐧𝐜𝐢𝐧𝐠 𝐅𝐫𝐨𝐧𝐭𝐢𝐞𝐫 𝐒𝐜𝐢𝐞𝐧𝐭𝐢𝐟𝐢𝐜 𝐂𝐚𝐩𝐚𝐛𝐢𝐥𝐢𝐭𝐢𝐞𝐬, 𝐓𝐨𝐝𝐚𝐲 𝐚𝐧𝐝 𝐓𝐨𝐦𝐨𝐫𝐫𝐨𝐰 𝐂𝐡𝐚𝐫𝐥𝐨𝐭𝐭𝐞 𝐓𝐚𝐨 𝐚𝐧𝐝 𝐓𝐫𝐢𝐬𝐭𝐚𝐧 𝐓𝐚𝐠𝐞𝐫, 𝐓𝐮𝐫𝐢𝐧𝐠 𝐈𝐂𝐌𝐋 2026, 𝐒𝐞𝐨𝐮𝐥, 𝐉𝐮𝐥𝐲 6-11.