Pierre-Yves Aquilanti

Engineering Manager — AI Infrastructure & MLOps @ NVIDIA

United States

About

Experience

  • Engineering Manager - Research Clusters at NVIDIA
    Dec 2024 - Present · 1 yr 8 mos

    I lead & AI Efficiency team at NVIDIA for our internal GPU cluster. We own GPU cluster productivity and operational excellence for NVIDIA’s largest AI training infrastructure across providers. Our tools of trade are Slurm, NCCL, InfiniBand, Lustre, K8s and a growing stack of homegrown platforms we build ourselves. Our main focus is making sure every GPU-hour counts. We built the AI Efficiency platform — runtime goodput , GPU idle-waste — that gives NVIDIA’s researchers and leadership a clear picture of where time and compute go during training runs like Nemotron, Cosmos, and others. We qualify every new cluster before researchers touch it. Our test platform validates GB200 and GB300 deployments end-to-end up to workload-level acceptance. We’re also pushing into agentic AI for operations with ATLAS — autonomous workflows that detect bad nodes, optimize checkpointing, debug NCCL failures, and generate root-cause analysis without human intervention. That’s our next frontier. We dive deep from the network fabric up to the training framework and work relentlessly to raise the bar on what operational excellence looks like at NVIDIA scale.

  • Amazon Web Services (AWS) (Remote)
    • Head of Frameworks ML Solutions
      May 2022 - Dec 2024 · 2 yrs 8 mos

      I managed a team of Solutions Architects and Applied Scientists specializing in self-managed Machine Learning workloads Utilizing open-source frameworks such as PyTorch, Jax and TensorFlow. Our tools of trade are Slurm, Kubernetes, AWS Batch, Lustre and S3 in addition to open-sources team. Our main focus to build highly scalable architectures for is Large Language Models and what is now GenAI. If it takes thousands of GPUs, millions of cores or anything that push an architecture to its limits, that's our playground. My team works closely with engineering and customers on performance modeling and optimization for large scale training of LLMs. We work directly engineering and product teams to improve AWS services for Foundation Models builders, build our own tools and research. We collaborate with Nvidia and Meta's PyTorch team on bringing new technology on AWS. We dive deep into the system up to the models and work relentlessly to raise the technical bar and scale our knowledge in AWS and externally.

    • Principal Solutions Architect, HPC Specialist
      Apr 2020 - Jun 2022 · 2 yrs 3 mos

      • SME on AWS Batch developed architectures and metrics collection to help customers in Financial Services, Energy, Health-care to optimize for job throughput and cost. SME for Autonomous Vehicle simulations architecture (OEMs, Tier1, Partners). • Lead POCs, developed architectures and technical assets for Large Language Model training in AWS with Slurm and Batch. • Led technical tutorials on HPC in the Cloud as part of the Supercomputing conference in 2020 and 2021. Delivered and assisted for workshops, chalk-talks reInvent 2021 and the AWS SF Summit 2022 on large scale simulations and architectures for distributed Deep Learning training. • Mentored and trained teams and individual on delivering content for HPC and large scale simulations. • Led AWS technical team to generate epidemiological simulations early 2020, some of the work is highlighted by Werner Vogl (Amazon's CTO) on his blog (When scaling your workload is a matter of saving lives). • Worked with Product Management and Engineering on solving customer issues, define new features and roadmap. Wrote two narratives on usability of services which were reused for GTMs and roadmap planning.

    • Senior Solutions Architect, HPC
      Dec 2017 - Apr 2020 · 2 yrs 5 mos

      • Developed architectures and best practices to run computational workloads in AWS (CFD, O&G and Autonomous Vehicles - AV). This involved performance testing, build prototypes and work with product management & engineers on features & resolve issues. • Created a large scale simulation architecture on AWS Batch which unblocked customers in O&G, AV and Financial services. This architecture is used for all large scale workloads on the service (hundreds of thousands to millions of vCPUs) • Lead, executed and advised on several large scale workloads (1M+ vCPUs) for EDA, O&G (seismic imaging) and Autonomous Vehicle simulations (log/replay & virtual). One of the projects in collaboration with Univa and Western Digital is highlighted in a Jeff Barr blog post (see the Medias below). • Created HPC workshops and tutorial content for AWS, partner and external events (AWS re:Invent as well Supercomputing, Rice Energy HPC Conference). The one on AV Simulations with CARLA on AWS Batch was used as a template for several AV startups, partners and automotive companies (OEM, Tier 1). • Initiated and lead a tutorial on Best Practices for HPC in the Cloud as part of the tutorial track at Supercomputing (2019). The event had ~100 attendees (largest room) and a rating of 2.9/3.

  • HPC Software Specialist at TOTAL
    Jul 2013 - Dec 2017 · 4 yrs 6 mos

    • Lead and contributed to several machine learning based applications and proof-of-concepts applied to image analysis, time series prediction and natural language processing using Python, R and Scala. • Development and optimization of machine learning pipelines on HPC and HPDA systems using Spark, Dask and multi-processing. • Lead and architect of the Carbon project, an internal collaborative and knowledge capitalization platform for R&D and scientific developments with hundreds of worldwide users. • Taught courses on development workflows with Git, gitflow and JIRA. • Support and optimization on seismic software for HPC systems. • Communicated results using mediums such as videos, newsletters, presentations and training.

  • Senior Software Analyst at A*STAR - Agency for Science, Technology and Research
    Mar 2012 - Jun 2013 · 1 yr 4 mos

    • Collaborated on the use graph analytics for archaeology to find missing relationships between objects, places and individuals. • Taught courses on HPC tools, libraries and techniques to the Singaporean research community in collaboration with the University of Illinois at Urbana-Champaign. • Lead and provided support through trainings and workshops in collaboration with researchers and experts in High Performance Computing. • Collaborated on a tender call to provision a new supercomputer for A*STAR, benchmarked, submitted proposals.

  • Senior Software Engineer at CNRS
    Mar 2011 - Sep 2011 · 7 mos

    • Optimization and rewriting of an asynchronous distributed Krylov solver and preconditioner from Fortran to C using PETSc/SLEPc. • Modification of PETSc/SLEPc sources to integrate asynchronous MPI communication between groups of processes using inter and intra MPI communicators.