Post by Turing
2,063,640 followers
Benchmarks don’t fail in isolation. They fail in production. Most evaluation systems still optimize for static performance, not real-world behavior. At ICLR, we’ll share how teams are: -> Capturing failure modes across agent runs -> Measuring drift over time -> Converting deployment friction into training signal This is where post-training becomes a system, not a step. See you there! #ICLR26