Taiwan
I am an M.S. student in Computer Science and Engineering at NYCU, focusing on efficient LLM inference systems. My work centers on profiling-driven optimization of LLM inference pipelines, with a particular emphasis on: - Speculative decoding - KV-cache efficiency - Long-context generation - Memory and bandwidth bottlenecks in large-scale inference I am a co-first author of Dustin (ICML 2026), which introduces sparse verification for long-context speculative decoding, achieving up to 9.17× decode-stage speedup by reducing KV-cache loading overhead. I am also an equal-contribution second author of SubSpec (NeurIPS 2025), a training-free and lossless speculative decoding framework for offloaded LLMs, achieving up to 12.5× end-to-end acceleration. I am interested in: - LLM inference systems - GPU performance optimization - Efficient model serving (multi-batch, long-context, memory-constrained settings) Feel free to reach out for collaboration or opportunities in ML systems and AI infrastructure.
Served as Teaching Assistant for the graduate-level course Edge AI(CSIC30166), assisting over 80 students in the first semester and 150 students in the second semester. - Designed and graded programming assignments covering model compression (quantization and pruning), parallel inference of large language models (LLMs) across multiple hardware devices, and Triton kernel design. - Developed and deployed optimized neural models on Raspberry Pi and other edge devices. - Assisted students in implementing and optimizing AI algorithms and deployment pipelines.
- Surveyed MLIR compiler infrastructure and studied its applicability to LLM deployment workflows on the NPU platform. - Explored the feasibility of mapping LLM workloads onto the NPU under hardware-specific constraints.