Jian-Jia Chen

M.S. @ NYCU | LLM Inference Systems & Optimization | ICML 2026, NeurIPS 2025

Taiwan

About

I am an M.S. student in Computer Science and Engineering at NYCU, focusing on efficient LLM inference systems. My work centers on profiling-driven optimization of LLM inference pipelines, with a particular emphasis on: - Speculative decoding - KV-cache efficiency - Long-context generation - Memory and bandwidth bottlenecks in large-scale inference I am a co-first author of Dustin (ICML 2026), which introduces sparse verification for long-context speculative decoding, achieving up to 9.17× decode-stage speedup by reducing KV-cache loading overhead. I am also an equal-contribution second author of SubSpec (NeurIPS 2025), a training-free and lossless speculative decoding framework for offloaded LLMs, achieving up to 12.5× end-to-end acceleration. I am interested in: - LLM inference systems - GPU performance optimization - Efficient model serving (multi-batch, long-context, memory-constrained settings) Feel free to reach out for collaboration or opportunities in ML systems and AI infrastructure.

Experience

  • Teaching Assistant at National Yang Ming Chiao Tung University
    Feb 2024 - Jul 2025 · 1 yr 6 mos

    Served as Teaching Assistant for the graduate-level course Edge AI(CSIC30166), assisting over 80 students in the first semester and 150 students in the second semester. - Designed and graded programming assignments covering model compression (quantization and pruning), parallel inference of large language models (LLMs) across multiple hardware devices, and Triton kernel design. - Developed and deployed optimized neural models on Raspberry Pi and other edge devices. - Assisted students in implementing and optimizing AI algorithms and deployment pipelines.

  • Software Engineer at Neuchips
    Apr 2025 - May 2025 · 2 mos

    - Surveyed MLIR compiler infrastructure and studied its applicability to LLM deployment workflows on the NPU platform. - Explored the feasibility of mapping LLM workloads onto the NPU under hardware-specific constraints.