Bengaluru, Karnataka, India
I build and lead high-performance engineering teams that optimize deep learning workloads across heterogeneous compute platforms—GPU, CPU, and custom AI accelerators. My expertise spans end-to-end system software: MLIR-based compilers, kernel fusion, scheduling, distributed inference runtimes, and scalable data pipelines for multi-trillion-token pretraining. I’ve delivered production-grade systems for AI training and inference such as inference serving platforms, PyTorch MLIR backend infrastructure and 2T+ token data factories. My career includes 20+ years across Cerebras and Intel, driving performance breakthroughs (1.3×–4× kernel speedups, 2× BERT roofline performance, 3× data throughput improvements) and building organizations that innovate at the boundary of hardware and AI. I thrive at the intersection of AI models, compilers, kernels, and semiconductor hardware, translating architecture constraints into scalable, efficient system software. My goal: accelerate the next generation of GenAI workloads through world-class engineering leadership and full-stack optimization.
High-Performance Inference: Directing kernel, runtime, and compiler Scalable Infrastructure: Developed a GPU-accelerated "Data Factory" pipeline for agentic RL corpus generation and optimized distributed data-pipelines for SOTA pretraining
Kernel Delivery: Led a global engineering organization delivering 50+ deep learning math kernels for Intel GPU and Gaudi hardware, achieving up to 4× speedups for LLaMA and other GenAI models. Compiler Optimization: Drove adoption of MLIR-based compilers and designed pattern-rewriting subsystems to optimize PyTorch operator lowering and fusion efficiency. Performance Benchmarking: Resolved critical CPU/GPU pipeline bottlenecks to ensure competitive MLPerf results and improved end-to-end throughput by 30-50% across key AI workloads. Organization Scaling: Built Intel India’s Deep Learning Kernel R&D organization from the ground up, doubling the team size and establishing a sustainable pipeline of technical leadership. State-of-the-Art Performance: Delivered a 2× speedup for BERT on Intel Nervana (NNP) architecture by pioneering GEMM stacking and kernel fusion techniques that hit the theoretical roofline limit.
• Responsible for delivering and certifying various SRS post processing modules for TV550 Platform • Study and Implementation of dual decoding and rendering for Dolby MS10 audio codec’s on TV550 Platform • Optimization of modules like FFT of Dolby Pulse audio decoder. • Leading team of engineers to solve issues arising out of system integration and testing
• Design and Implementation of various audio codec and post processing modules like downmixer and LOAS parser • Proposal of comprehensive testing framework and test specification along with certification starategy for various audio codecs.. • Analyzing user requirements for AutoFormat detection and switching of audio decoders based on the incoming audio bit stream (HEAAC/AC3Plus/MPEG1L2) in digital broadcast scenario and implementing the same. • Onsite support for system integration and fixing issues at customer site. • Patent Granted : APPARATUS FOR RECEIVING AND RENDERING AUDIO-VIDEO STREAMS AND TRANSMITTING A FURTHER STREAM TO AN EXTERNAL DEVICE
• Porting of Audio Equalizer to tm3260 platform for Media processing Tool Kit • Involved in defining and implementing the interfaces for various audio post processing components like virtualization, Equalization, audio routing. • Design and implementation of a reference Player to demonstrate MP3 file playback using 2 different cores (MIPS and Trimedia DSP). MIPS core for control and I/O operations like file read, file write and the Trimedia core for actual MP3 decoding and rendering. • Extensive onsite costumer support to take the platform into production.
• Redesign and refactoring of audio subsystem for mainstream hybrid DVB LCD TV (TV520) • Defining test strategy for emulation testing and unit testing of drivers during early stage of system on chip development. • Involved in system level integration of audio subsystem and fixing system issues arising out of integration. • Implementation of NHAPI interfaces for SPDIF in and SPDIF out components.
• Implementation of I2C drivers for Motorola Star12 processor to control on board peripherals like RTC, EEPROM. • Implementation of Timer unit to work as UART.