Boulder, Colorado, United States
* Planet-scale highly reliable systems: experience with running and maintaining reliability of these systems. * Distributed Large scale system design: designing robust and highly reliable distributed systems. Able to identify potential weak points, SPOFs and hot spots, cost-benefit/tradeoff analysis. * Debugging and problem solving: root-cause analysis of problem, damage control, incident response and management. * Production monitoring: experience with designing and maintaining monitoring systems for large scale distributed systems. Ability to design and implement robust monitoring system with low rate of noisy alerts. * Infrastructure engineering: building and maintaining cloud-based and distributed software systems, dealing with the full lifecycle (feature engineering, testing and validation, builds and deployment)
Member of infrastructure team. Initiated and completed migration of entire production system to k8s, improved monitoring, handled numerous outages, participated in the post-mortem culture and implemented numerous performance and diagnostic features in the complex distributed system. Always focused on reducing the engineering friction
I have walked ~2500 miles of PCT from Mexican border to Canadian border in 140 days. I have marveled at the beauty of the wild places, slept in the woods and enjoyed life to the fullest.
Search Infrastructure. Responsible for building and maintaining scalable user-facing content recommendation system. Other responsibilities include: incident response, monitoring, elimination and retirement of complex legacy systems, managing cross-team refactoring efforts, designing and implementing features for improved debugging and robustness, design reviews and promotion of knowledge sharing.
Member of Abuse SRE team in San Francisco. Responsible for running, maintaining and scaling systems that are responsible for detection of malicious content and activities on Google properties. Various responsibilities include: performance and scalability testing, weak point identification, maintaining monitoring, operations automation.
Keeping critical infrastructure up and running. Responsible for scalability and reliability of large-scale low-latency distributed systems. Capacity planning and performance testing, developing tools and automation for various tasks. Member of Search Ads Serving SRE team.
Design and implementation of "IPCorder" surveillance system running on embedded ARM and PowerPC devices.