Markus Toman

Head of Veritone Labs

Altenmarkt im Pongau, Salzburg, Austria

About

Principal AI Architect and Research Lead living at the intersection of deep learning and systems engineering. My journey started with typing initially random incantations at the DOS prompt of a trusty 386 and has since taken me through embedded systems, medical computer vision, 3D graphics, a PhD in Computer Science, almost a decade in speech synthesis research, and into multimodal systems and cognitive agent architectures. I am equally at home dissecting latent space representations and optimizing GPU flamegraphs, the science and the metal. That combination of research depth and systems instinct is what I bring to every problem, whether forcing a research-grade model onto a constrained edge device or architecting a production-scale multimodal pipeline. For the past several years I have directed R&D across US and European time zones, leading teams and setting technical direction while staying close to the code, the models, and the infrastructure that make things actually work.

Experience

  • Veritone (Full-time · 4 yrs 1 mo)
    • Head of Veritone Labs
      Apr 2024 - Present · 2 yrs 3 mos

      • Directed applied R&D for multimodal systems architectures, driving the transition from SaaS dependencies to in-house, hardware-aware AI pipelines. • Engineered a highly reliable, POMDP-based analytical agent, replacing black-box LLMs with a strict state-machine and custom token-efficient DSL. Built a Polars-backed hierarchical memory system (L1-L3) for deterministic data lineage and dynamic agent context management. • Architected a low-latency federated multimodal search engine (videos, images, audio, documents), using custom fine-tuned SLMs for zero-shot query analysis. Leveraged SGLang and constrained decoding to guarantee deterministic routing across semantic (embedding) indices and legacy sparse search systems. • Optimized aiWARE Vision-Language Model (VLM) inference pipelines for large-scale media processing. Maximized GPU throughput via smart KV caching mechanisms and flamegraph analysis, accelerating spatial-temporal telemetry extraction across broadcast, police and gaming environments.

    • Principal Applied Scientist
      Oct 2023 - Apr 2024 · 7 mos

      • Architected the core video semantic search engine for large-scale media (news, entertainment, sports, law enforcement). Dissected foundational vision embedding models and analyzed latent spaces to engineer semantic scene boundary detection, mitigate high-similarity noise edge cases, and optimized indexing tradeoffs. • Engineered specialized ML media processing pipelines, including neural speech anonymization models for witness protection. Debugged and refactored underlying C++ video decoding libraries to solve data-loading issues during embedding extraction. • Designed and deployed "Veri", the Veritone-internal agent available in Slack and through a dedicated Web UI. Built a comprehensive ETL pipeline featuring incremental indexing, hierarchical document chunking, automated summarization, and strict access-control enforcement, with an automated feedback loop to continuously curate and improve internal documentation.

    • Head of Voice Engineering
      Jun 2022 - Oct 2023 · 1 yr 5 mos

      • Led R&D and engineering for highly expressive, personalized Text-to-Speech systems, bridging foundational deep learning with commercial digital human and assistive applications. • Architected custom transformer- and diffusion-based TTS models with explicit control parameters for highly nuanced, personalized voice cloning for assistive technology, celebrities and commercial avatars (e.g. Cameo Kids). • Engineered automated data-cleaning and MLOps pipelines, scaling the infrastructure to systematically annotate datasets and train, evaluate, debug, and deploy thousands of custom voice models in production.

  • Head Of Research And Development at VocaliD (acquired by Veritone)
    Apr 2016 - Jun 2022 · 6 yrs 3 mos

    • Drove the core AI R&D behind VocaliD's mission: giving people who cannot speak a voice that sounds like them. Built on a crowdsourced voicebank of 60,000+ donors and deployed to AAC devices for users with ALS, cerebral palsy, and severe dysarthria. • Engineered personalized TTS models across the full deep learning arc, from HMM baselines through DNN acoustic models, CNN/RNN sequence models, Transformer architectures, GAN and diffusion-based acoustic/prosodic models and vocoders. Strong focus on explicit prosody, duration, and phonetic control to preserve each user's residual vocal identity. • Deployed production TTS inference in C++ across Android, iOS, Windows, and embedded AAC hardware of partner companies, forcing research-grade architectures into the memory and latency budgets of early mobile and assistive devices.

  • External Lecturer at Fachhochschule Wiener Neustadt
    Sep 2018 - Feb 2022 · 3 yrs 6 mos

    Taught "Operating Systems and Networks", grounding students in hardware-software interaction, resource allocation, and concurrency from first principles.

  • Co-Owner at MyGEWO.at
    Jul 2012 - Dec 2016 · 4 yrs 6 mos

    Architected and launched the Austrian housing search engine mygewo.at, scaling community infrastructure to 30,000+ active users.

  • Researcher at Telecommunications Research Center Vienna (ftw.)
    Apr 2012 - Jan 2016 · 3 yrs 10 mos

    • Published 15 peer-reviewed papers on statistical parametric speech synthesis, with a focus on accessibility for blind and visually impaired users and unsupervised dialect conversion, work that directly seeded the PhD thesis and later VocaliD research. • Built open-source C++ inference engines that shipped statistical TTS models to Android, iOS, and BlackBerry under the memory and compute constraints of 2012-era mobile hardware.