San Francisco Bay Area
Site Reliability Engineer / Software Engineer interested in diagnosing, designing, implementing, and troubleshooting large-scale Cloud platforms, with a focus on Machine Learning infrastructure. Proven record in leading multi-stakeholder projects that require highly efficient and practical solutions. Experience with rapidly learning new codebases/systems and effectively leading engineering teams to achieve independence. Core competencies: - Kubernetes, Go, Python, Java, Linux - Automation, observability, infrastructure deployment - Incident management and on-call - Technical project and people management
- Team lead for two machine learning systems: a ML inference and a ML training Kubernetes cluster, plus associated API, web frontend, service discovery, monitoring, testing, and deployment infrastructure. - Brought 9-member team of new US hires (junior and senior) to 100% operational independence from original non-US service owners, and kickstarted a nucleus of software engineering expertise on these systems within the US team. - Designed and implemented projects to improve service redundancy, GPU utilization, ease of debugging for client teams, and deployment reliability via improved end-to-end testing. - Lead on new data center turnup for the team's services. - Mentor team members on individual pull requests as well as larger-scale designs - Plan near- and long-term (1yr+) team goals. - Foster collaboration within the team, between my team and US as well as non-US leadership.
Applied Machine Learning
Apps Core Infrastructure manages the internal settings database used by users of all Google Workspace (i.e., Google Apps for Enterprise) applications (>1M QPS) as well as the Groups infrastructure used by all internal and many external users (>100k QPS). Settings database: Lead on the effort to migrate internal clients of the Google Workspace customer settings database (>10k QPS) from an 10-year-old implementation to a much more performant datastore to enable the continued scaling of the settings database. Groups: took advantage of new infrastructure to measure metrics for the service level agreement (SLA) for 20+ individual RPCs, enabling us to successfully achieve a higher SLA for the service by identifying regressions impacting SLA. Groups: implemented server-level throttling that enabled the server to shed bursts of traffic that were reducing the availability of the server, unblocking a critical feature launch. Hosted summer intern, enabling them to deliver two projects requiring Google engineer-level design and implementation work by end of internship. As lead, migrated services implementing Google Vault (used for legal discovery) to the microservice-based standard listed below. Multi-quarter project requiring deep knowledge of the services and their dependencies to avoid disrupting service for this critical revenue-generating service. Lead on development of a tool to automate setup and migration of 100+ large-scale user-facing services to a microservice-based standard, in order to allow engineers to successfully manage these services with less need for idiosyncratic knowledge. Automated the migration of Google Workspace services between data centers to take advantage of new capacity. Maintaining high availability of the distributed systems behind Google Workspace, including developing monitoring and resolving major incidents. Performed >100 interviews for software and site reliability engineering candidates, including for coding and system design.
* Google Calendar and Sites Used the Go programming language to develop a significant system (~10K LOC) to automate a previously manual infrastructure maintenance process. Analyzed the code architecture of a long-lived user-facing service in order to quantify its current 30-day active users and perform maintenance (using Java). Wrote code to verify backup and restore process integrity for Google Sites Maintaining the high availability of the distributed systems behind Google Calendar and Sites, including developing monitoring and quickly solving pages and other customer-facing incidents.
Learned how to manage the Intersystems Caché database system on several Unix platforms, including Linux and AIX. Created and implemented tests for our customer-released Perl administrative scripts.
Created and implemented a DFT hybrid functional in the Gaussian computational chemistry program
Atomic symmetry calculations, DFT functionals in the Gaussian computational chemistry program