Vijay Srinivas

Software Leader

Bengaluru, Karnataka, India

About

I build and lead high-performance engineering teams that optimize deep learning workloads across heterogeneous compute platforms—GPU, CPU, and custom AI accelerators. My expertise spans end-to-end system software: MLIR-based compilers, kernel fusion, scheduling, distributed inference runtimes, and scalable data pipelines for multi-trillion-token pretraining. I’ve delivered production-grade systems for AI training and inference such as inference serving platforms, PyTorch MLIR backend infrastructure and 2T+ token data factories. My career includes 20+ years across Cerebras and Intel, driving performance breakthroughs (1.3×–4× kernel speedups, 2× BERT roofline performance, 3× data throughput improvements) and building organizations that innovate at the boundary of hardware and AI. I thrive at the intersection of AI models, compilers, kernels, and semiconductor hardware, translating architecture constraints into scalable, efficient system software. My goal: accelerate the next generation of GenAI workloads through world-class engineering leadership and full-stack optimization.

Experience

  • Software leader at Cerebras Systems
    Dec 2024 - Present · 1 yr 8 mos

    High-Performance Inference: Directing kernel, runtime, and compiler Scalable Infrastructure: Developed a GPU-accelerated "Data Factory" pipeline for agentic RL corpus generation and optimized distributed data-pipelines for SOTA pretraining

  • Intel Corporation (Full-time · 12 yrs 8 mos)
    • Senior Engineering Manager – AI Kernels, Compiler & Performance Intel Corporation
      Mar 2018 - Nov 2024 · 6 yrs 9 mos

      Kernel Delivery: Led a global engineering organization delivering 50+ deep learning math kernels for Intel GPU and Gaudi hardware, achieving up to 4× speedups for LLaMA and other GenAI models. Compiler Optimization: Drove adoption of MLIR-based compilers and designed pattern-rewriting subsystems to optimize PyTorch operator lowering and fusion efficiency. Performance Benchmarking: Resolved critical CPU/GPU pipeline bottlenecks to ensure competitive MLPerf results and improved end-to-end throughput by 30-50% across key AI workloads. Organization Scaling: Built Intel India’s Deep Learning Kernel R&D organization from the ground up, doubling the team size and establishing a sustainable pipeline of technical leadership. State-of-the-Art Performance: Delivered a 2× speedup for BERT on Intel Nervana (NNP) architecture by pioneering GEMM stacking and kernel fusion techniques that hit the theoretical roofline limit.

    • Firmware Design Engineer
      Apr 2012 - Mar 2018 · 6 yrs

  • Senior Technical Leader at Trident Microsystems
    Feb 2010 - Aug 2012 · 2 yrs 7 mos

    • Responsible for delivering and certifying various SRS post processing modules for TV550 Platform • Study and Implementation of dual decoding and rendering for Dolby MS10 audio codec’s on TV550 Platform • Optimization of modules like FFT of Dolby Pulse audio decoder. • Leading team of engineers to solve issues arising out of system integration and testing

  • NXP Semiconductors (5 yrs 1 mo)
    • Technical Leader
      2007 - 2009 · 2 yrs

      • Design and Implementation of various audio codec and post processing modules like downmixer and LOAS parser • Proposal of comprehensive testing framework and test specification along with certification starategy for various audio codecs.. • Analyzing user requirements for AutoFormat detection and switching of audio decoders based on the incoming audio bit stream (HEAAC/AC3Plus/MPEG1L2) in digital broadcast scenario and implementing the same. • Onsite support for system integration and fixing issues at customer site. • Patent Granted : APPARATUS FOR RECEIVING AND RENDERING AUDIO-VIDEO STREAMS AND TRANSMITTING A FURTHER STREAM TO AN EXTERNAL DEVICE

    • Senior Software Engineer
      2006 - 2007 · 1 yr

      • Porting of Audio Equalizer to tm3260 platform for Media processing Tool Kit • Involved in defining and implementing the interfaces for various audio post processing components like virtualization, Equalization, audio routing. • Design and implementation of a reference Player to demonstrate MP3 file playback using 2 different cores (MIPS and Trimedia DSP). MIPS core for control and I/O operations like file read, file write and the Trimedia core for actual MP3 decoding and rendering. • Extensive onsite costumer support to take the platform into production.

    • Software Engineer
      2004 - 2006 · 2 yrs

      • Redesign and refactoring of audio subsystem for mainstream hybrid DVB LCD TV (TV520) • Defining test strategy for emulation testing and unit testing of drivers during early stage of system on chip development. • Involved in system level integration of audio subsystem and fixing system issues arising out of integration. • Implementation of NHAPI interfaces for SPDIF in and SPDIF out components.

  • Graduate Engineer Trainee at Lnt EmSys
    2003 - 2004 · 1 yr

    • Implementation of I2C drivers for Motorola Star12 processor to control on board peripherals like RTC, EEPROM. • Implementation of Timer unit to work as UART.