Senior Performance Engineer - Linux Kernel & HPC infrastructure

Doghouse Recruitment

European Economic Area

Description

The organization

Our client operates one of the largest GPU infrastructures in the world across multiple global data centers. Their infrastructure doubles in size every year. We’re looking for engineers who love getting deep into Linux systems (including kernel-space), pushing hardware and software to their limits, and making the world’s fastest AI/ML workloads run even faster.

The role

You’ll join a small, senior team that works between the hardware and Linux OS layers, uncovering and solving performance problems that affect tens of thousands of GPUs. This is hands-on, high-impact engineering where microsecond gains matter and every optimization is felt at global scale.

The GPU & InfiniBand team is responsible for enhancing and optimizing the core components of the Cloud platform, with a specific focus on Linux kernel-space, GPU computing, InfiniBand networks, and the KVM/QEMU stack. This involves tracing, profiling and tuning in kernel space rather than user space.

You’ll work closely with hardware virtualization and device emulation technologies, ensuring high performance and security in multi-GPU environments. The role involves analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.

What you’ll do

This role involves kernel-space profiling and tuning rather than user-space optimizations, combined with strong low-level coding skills.

In this position, you will be responsible for:

  • Tuning the performance of clusters and high-speed networks, down to the Linux kernel-level, to ensure optimal operation in HPC and GPU-based environments.
  • Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
  • Integrating new hardware into the existing infrastructure, including support for new GPU hardware, through software stacks like Kubernetes, QEMU, and KVM.
  • Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
  • Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

Your profile

We expect you to have:

  • 4+ years of professional experience in system-level engineering (focused on performance optimization and low-level programming).
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, performance tuning). This should include kernel-space expertise.
  • Proficient with one or more relevant "tools of the trade": perf, (bp)ftrace, (e)BPF, kdb / kgdb, systemtap, LTTng, blktrace, Netstat, Tuna, ETHtool, sysctl et cetera.
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel etc.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python). This includes an excellent grasp of data structures & algorithms.

You furthermore have experience with one or more of the following:

  • GPU end-to-end testing in a cluster environment using InfiniBand networking.
  • Analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
  • RDMA, RoCE, and InfiniBand protocols for high-performance communication.
  • Software-Defined Networking (SDN) and experience with HPC cluster networking.
  • QEMU/KVM virtualization and managing virtualized environments.
  • Deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
  • Collective communication libraries like MPI and NCCL for distributed computing.

This is for you if you

  • Love solving deep technical challenges, care about performance down to the microsecond, and want to work on infrastructure that pushes the limits of what’s possible.
  • Get enthusiastic about the prospect of joining a massively scaling organization, and the chances this offers to take ownership and end-to-end responsibility.

What's offered

  • Salary: up to 200k OTE.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.
  • Location: Amsterdam (hybrid) or full-remote within Europe.