Greater Seattle Area
Team: Metadata
Team: Cloud Capacity Planning and Management * Led multiple 12+ month programs automating capacity management across GCE, BigQuery, Persistent Disk, and other GCP Long Tail Services. These systems proactively reallocate capacity based on dynamic supply and demand, ensuring optimal resource utilization. * Led the program that streamlines onboarding new capacity (turning up new clusters and adding new machines to existing clusters) and managing existing capacity. * Contributed to making GCP profitable with $XXX Million in annual efficiency gains, reducing idle capacity, safety stock, and manual toil. * Served as a TL and worked with management to grow the team from 5 to 35+ SWEs. Assisted new hires at various levels in ramping up, defined quarterly OKRs and roadmaps for the team, assigned and tracked work, and facilitated multiple SWEs' promotions to the next level.
Team: Cloud Capacity Planning and Management * Led efforts to transition capacity management tools (libraries, CLIs, cron jobs, playbooks) into services with high reliability, resulting in an 87% reduction in production incidents. * Reduced the blast radius from global to zonal, and designed/implemented monitoring and alerting systems for proactive incident detection. * Contributed to building tools that change machine/resource (CPU/RAM/SSD) assignments between various GCP capacity pools. Created user guides enabling operational teams to move capacity on demand with ease.
* Enhanced Apache Spark’s shuffle performance up to 2.7x by migrating the layer to a customized RDMA-based solution.
* Improved Apache Spark’s RDD caching performance by integrating with IBM’s Databroker, offloading RDDs to increase memory utilization from 78% to 95%, and reducing garbage collection overhead by 83%.