Post by Vassilis Vassiliadis

Research Scientist at IBM

Kueue v0.17 just came out, with a feature I contributed that enables users to defer the admission of Jobs to something other than Kueue! The context for this functionality is to add support in Kueue for upcoming Kubeflow enhancements that will perform automatic resource configuration of TrainJobs based on user objectives like max throughput at reasonable cost or minimum makespan. Fun fact, this reconfiguration scenario was the user story I wrote down in my Kueue Enhancement Proposal: https://lnkd.in/dp2KfAXx For the technically curious, this AdmissionGatedBy feature in Kueue avoids two problematic scenarios with auto-configuration: 1. Kueue pre‑empting lower priority running jobs just to make room for higher priority jobs before those new jobs finalize their resource requests. 2. Kueue scheduling a GPU job too early, causing it to fail when it restarts after Kubeflow updates its resources, because the checkpoint it created before restarting no longer matches the new GPU count. Also, Kueue takes ownership of the suspend API for objects and then decides when to unsuspend jobs. This feature is the only way to create a suspended Job in a Kueue managed namespace while still allowing the user (or controllers) to decide when the job becomes a candidate for admission. Stay tuned to find out how Kubeflow could integrate with this Kueue feature to automatically, and reliably, configure the resource requirements of TrainJobs. There're even contributions by RedHat at the Kubernetes and JobSet API which we can leverage in Kubeflow! For more information on AdmissionGatedBy in Kueue, see: https://lnkd.in/dXiEdCcQ Srikumar Venugopal, Daniele Lotito, Michael Johnston