San Francisco, California, United States
Working with the Voyage AI team to scale their production inference stack in order to handle third party traffic and the large, global traffic stream of MongoDB auto-embedding.
Scaled SambaCloud (https://cloud.sambanova.ai/) over its first year. - Led the implementation of a queuing system in Rust to meet customer SLAs and achieve high hardware utilization. - Rewrote the billing system to support thousands of monthly customers, integrating AWS and Stripe. - Ensured the system reliability as SambaCloud scaled to billions of tokens each day served by thousands of accelerator chips across three data centers around the world. Led SambaCloud’s repackaging into several product lines meeting a diverse set of customer needs. - Streamlined the SambaCloud administration story to facilitate operation by external users. - Enabled a hosted deployment with customer administration but SambaNova-managed hardware. - Enabled an on-prem deployment with customer administration and customer-owned hardware. - Enabled traffic-sharing between customers and SambaNova to increase reliability and profitability.
Architected the SambaCloud (https://cloud.sambanova.ai/). - Designed the cloud architecture and worked with a high performance team to launch it in just 2 months. - Implemented the services responsible for handling API requests and routing them to hardware (used go, python, and Redis Lua scripts). - Wrote the Helm chart that orchestrated the components across the production cluster. Designed runtime APIs to enable new product lines. - Built consensus with key stakeholders on the new API’s requirements. - Led a team to implement the new design quickly and efficiently. - Coordinated with several teams across time zones to successfully release the desired product on schedule. Created cross-cutting performance improvements in the inference stack. - Directly optimized the C++ runtime code to reduce host overhead from 100ms per invocation to 0.5ms. - Worked with many teams to bring about a new inference flow to amortize host overhead across multiple tokens. - Drove the technical implementation of continuous batching across 5 teams, vastly increasing serving efficiency.
Designed a new JIT integration for PyTorch to run on SambaNova’s hardware. - Proposed the project (inspired by torch_xla) and created a prototype. - Integrated with the PyTorch dispatcher to create a seamless user experience. - Led a team to create ATen to SambaNova MLIR lowerings. - Developed the first dynamic memory management capabilities for SambaNova hardware.
Created an augmented reality remote assist application for the HoloLens, which interfaced with Internet of Things devices over Bluetooth and utilized Machine Learning for predictive analytics.