Post by Proximal

1,248 followers

Today, we're Introducing Frontier SWE, an ultra-long horizon coding benchmark. Frontier SWE tests agents on some of the hardest technical challenges including optimizing a video rendering library, posttraining a model to predict the quantum properties of molecules and more. The models, despite being given 20 hours to complete each task, almost never succeed at them. This is the first benchmark that tests superhuman coding abilities. It is broken down into 3 categories: Implementation, Performance Optimization and Research. To illustrate the difficulty of this benchmark, the models were never able to solve any of the tasks in the implementation category. We made sure these tasks reflect diverse real-world use cases by building them in collaboration with academia and industry experts. For example, we partnered with Modular to check if agents can build an inference pipeline for Wan 2.1 on MAX and used Tinker from Thinking Machines Lab to check if agents can build an entire post-training pipeline.