Post by Ziqi Liu

B.S. in Computer Science @UW-Madison

Happy to share that our paper “A benchmark of expert-level academic questions to assess AI capabilities” (Humanity's Last Exam) has been published in Nature. 🎉 I’m grateful to be a co-author as part of the HLE Contributors Consortium, contributing to the development of the HLE benchmark. HLE is a new multimodal benchmark aimed at pushing beyond increasingly saturated academic benchmarks (e.g., MMLU). It focuses on expert-authored questions with precise, verifiable solutions, designed to probe large language models at the frontier of human knowledge—well beyond what can be solved through simple web retrieval. Despite rapid progress on many existing benchmarks, HLE remains challenging: state-of-the-art models still show low accuracy and poor calibration, with leading systems achieving under 40% accuracy on the public leaderboard. Paper: https://lnkd.in/gDDGrcw9