Santa Clara, California, United States
I am a Ph.D. student in Computer Science from the University of Utah, where I also work as a research assistant under Prof. Sadayappan's supervision. My research focuses on developing and optimizing high-performance computing (HPC) and artificial intelligence (AI) kernels for various platforms and frameworks, such as Kokkos, SYCL, and Cerebras CS-2 AI Accelerator. I have over five years of experience in conducting cutting-edge research and collaborating with leading academic and industry partners, such as IBM Research, Intel AI, Berkeley Lab, AMD Research , and Ohio State University. I have published and presented my work at prestigious conferences, and have received multiple honors and scholarships for my academic excellence and achievements. I am passionate about solving complex and challenging problems in HPC and AI domains and contributing to the advancement of scientific discovery and innovation.
Developing Hip kernels for ML libraries (MIOpen, CK) on ROCm platform.
I dive deep into exploring novel hardware/software co-designed techniques. The primary goal is to enhance GPU performance and power efficiency, focusing on key computational kernels within two cutting-edge areas: Large Language Models (LLMs) and High-Performance Computing (HPC).
As a Co-Op Engineer intern, working on designing and optimizing hip Kernels for the MIOpen framework and doing research on Machine Learning runtime to improve efficiency and reliability.
Contributing to the research regarding the optimization of tensor contractions Lattice QCD application by utilizing tensor contractions tree scheduling methodology for distributing tasks across multiple GPUs, under the supervision of Aydin Buluc and Oguz Selvitopi at Passion LAB.
• Project -1 ”Tensor Contraction Kernel Development on GPUs for Kokkos and SYCL Framework The research is developing an optimized KokkosTensor API that supports tensor transpose and tensor contractions, as well as optimization of tensor expressions involving tensor contraction and other element-wise tensor operators. I am mainly responsible for loop-based tensor contraction implementation development on GPUs by Cuda/HIP and have been working on designing an effective tensor contraction kernel on Kokkos including architecture-awared tuned implementation (e.g NVIDIA,AMD and Intel accelerators) • Project -2 ”ML Kernel Development for CS-2 Cerebras AI Accelerator” My current focus is implementing ML kernel for CS-2 (Cerebras) architecture by using their private SDK and data-flow language called CSL. This specific programming language, CSL, enables us to design data flow programming across PEs in the CS-2 accelerator. CS-2 is a 2D mesh-based AI accelerator consisting of 850,000 PEs ( Compute Units). I am targeting to develop a specific kernel for Transformer Model in NLP on this CS-2 accelerator, which minimize data movement across the device and host. • Project -3 ”Compressing Transformers with Tensor Factorizations” On this project, we are seeking answer to question ”Can we effectively compress transformers with tensorized components?, how to chose good or the best tensor factorization without sacrificing accuracy on Transformer Model ? We aim to develop new compressed decomposed Transformer model by several methods ( Tensor Networks, Tensor Train Decomposition etc.)
Working on optimization of tensor contractions for nuclear physics application by utilizing distributed tensor contractions across multiple GPUs, using partitioning under the supervision of Prof.Aydin Buluc at Passion LAB. ( https://passion.lbl.gov )
I was working on the BERT model for the Intel AI TensorFlow team under the supervision of Wei Wang. Our work has been presented and published at SC20 as Research Poster. http://sc20.supercomputing.org/proceedings/tech_poster/poster_files/rpost111s2-file3.pdf https://sc20.supercomputing.org/proceedings/tech_poster/tech_poster_pages/rpost111.html