Prague, Prague, Czechia
I am a seasoned researcher in speech recognition and natural language processing, with over 20 years of experience spanning industry and academia. For the past four years, my focus has been on automatic speech recognition (ASR) and speech synthesis (TTS) for non-standard speech, including speech from individuals with ALS, TBI, and CP. This work involved designing and developing data collection and processing pipelines, as well as applying and evaluating AI techniques to improve recognition accuracy for atypical speech patterns. Previously, I worked as a Machine Learning Scientist at Apple, contributing to Siri's server-side and on-device (private federated learning) models. I have also collaborated with several startups, modernizing their ASR pipelines using state-of-the-art AI technologies. In academia, I served as an Assistant Professor at Charles University in Prague, where I led a research group of 4 PhD and numerous MSc students—many of whom received national awards and secured internships or positions at leading AI companies such as Apple, Google, and Microsoft. Earlier in my career, I held postdoctoral positions at Cambridge University (UK) and was a Visiting Scholar at Johns Hopkins University (USA)
Working on agentic solutions in the legal space
I focused on developing cutting-edge speech recognition solutions and dialogue systems to enhance user interaction. • Utilized Kaldi/K2 to improve accuracy and efficiency in speech recognition applications. • Implemented algorithms in C++, Python, and Pytorch for both server-side and on-device processing. • Optimized large language models for effective data processing and conversational agent functionality.
Developed and trained server-side streaming Automatic Speech Recognition (ASR) systems (K2 / Zipformer2). Engineered on-device (in-browser) streaming ASR (K2 / Zipformer2, TypeScript, ONNX, TensorflowJS, WebNN), collaborating with Intel’s WebNN team to leverage the latest AI capabilities of Intel’s NPU and GPU technologies for PC laptops. Implemented data augmentation techniques for dysarthric speakers through speech synthesis, utilizing TTS models like PIPER (VITS) to train personalized Text-to-Speech (TTS) systems. Reported directly to the CEO and company leadership, providing updates and strategic insights (BI Redash / SQL). Presented research and in-house speech recognition technology to potential investors, effectively communicating technical advancements and business potential. Managed a 3-person team responsible for continuous speech annotation, ensuring quality and throughput. Oversaw the transition of the AI research compute cluster to AWS with Slurm technology, optimizing performance and scalability.
Managed a team of 4 speech researchers, setting priorities and reporting progress to management. Monitored ASR performance indicators both in-lab and in production, ensuring optimal system performance. Led hiring efforts for new team members, overseeing recruitment and selection processes. Developed and implemented a DNN-based wake word detector (Kaldi TDNN-like architecture). Enhanced speech recognition accuracy by fine-tuning the model parameters - size, learning rates ... Created and deployed a DNN-based speech synthesis backend server, supporting advanced speech processing (PIPER - Voices trained on public datasets, e.g. VCTK, LibriTTS). Deployed a prototype speech data collection platform, facilitating the gathering of high-quality speech data for training based on CommonVoice software. Developed in-house guidelines for continuous speech annotation, improving consistency and efficiency in the process. Prepared requirements for a continuous speech annotation tool, aligning technical specifications with team needs.
Deployed and managed Azure CycleCloud with SGE scheduler, improving scalability and significantly increasing team efficiency in running large-scale ASR experiment workloads. Developed and implemented on-device (iOS) Voice Activity Detection (VAD) models (CNN/TDNN-like) and supporting code, enhancing real-time audio processing capabilities.
Developing on-device private federated learning systems for Siri, enabling privacy-preserving model training across distributed user devices. Conducting research on natural language generation (NLG) for Siri, focusing on improving the quality and responsiveness of conversational interactions.
Mainly working with Omilia. Developing KALDI-based acoustic models for Automatic Speech Recognition (ASR), enhancing recognition accuracy and robustness. Integrating KALDI’s online decoder into proprietary ASR pipelines, enabling efficient real-time speech recognition. Designing methods for on-the-fly composition of acoustic models and statistical language model (LM) decoding grammars, optimizing runtime flexibility. Implementing techniques for on-the-fly interpolation of decoding grammars, improving adaptability to dynamic speech contexts. Implemented a Voice Activity Detector (VAD) using KALDI's NNET3 framework, enabling efficient speech segmentation. Integrated KALDI’s speaker recognition tools into proprietary systems, extending capabilities in speaker verification and identification.