Eric Shamow

Reliability and Engineering Leadership

Portland, Oregon, United States

About

Profile Leader, engineer and coach focused on increasing communication, transparency and collaboration across technology organizations, from a cloud/infrastructure perspective. Extensive experience facilitating culture and technology change in companies of all sizes in all verticals. Well-regarded writer, educator and presenter. Skills - Ansible - Puppet - Systems Operations (Linux/UNIX/Windows) - Infrastructure Architecture/Design - Ruby - Python - Continuous Integration/Delivery - DevOps - Product

Experience

Sourcegraph (Remote)
- Engineering Manager, Services Organization
  Sep 2024 - Present · 1 yr 10 mos
  Engineering Manager for Cloud Operations, Core Services and Cody Prime teams
- Engineering Manager, Cloud Operations
  Nov 2023 - Dec 2024 · 1 yr 2 mos
Director - Reliability, Operations and Support at HeadSpin
Apr 2023 - Nov 2023 · 8 mos
Engineering Director and subject matter expert leading startup-to-enterprise transformation at global mobile testing company. Responsible for unifying strategy across operational and customer-facing teams and connecting feedback to backend and platform organizations. * Initiated and executed effort to containerize entire 30-service application platform. * Developed, piloted and implemented document-driven decision making process and wrote several Architecture Decision Records for key initiatives including Infrastructure as Code and containerization. * Wrote extensive policy including revamping hiring process and incident management policy. * Reorganized and consolidated India-based support team, establishing an escalation team responsible for evaluating customer issues, developing policy and runbooks for Tier I support, and identifying target areas for automation and platform improvements.
SRE Manager at Twitter
May 2021 - Nov 2022 · 1 yr 7 mos
First full-time manager for SRE team attached to the Compute Platform team, responsible for all Mesos and Kubernetes workloads and approximately 40% (250K hosts) of the overall fleet * Led development of a tool to improve and automate compliance and hardware operations, enabling service guarantees and reducing time to update the Mesos fleet from multiple quarters to within 90 days. * Oversaw the implementation of colocation and host optimization efforts resulting in > $160M CAPEX reduction in 2021-2022. * Part of a three-manager team tasked with writing the 2022 Platform Reliability Strategy * Introduced unified metrics, developed new work and work intake processes, created an SRE backlog, and worked with Product and Platform Engineering leadership to prioritize that backlog, enabling the team to use SLOs to align with the Platform Reliability Strategy. * Oversaw the hand off of Role Swap Automation, a tool developed within Compute SRE to enable the Hardware Operations team to self-serve moving hosts between teams.
BlueOwl, LLC (Portland, Oregon Metropolitan Area)
- Manager, Platform and Reliability
  Apr 2019 - May 2021 · 2 yrs 2 mos
  Acting technical director for 40-person engineering team; manager of 2-person platform and 6-person Site Reliability teams. * Conceived and led company-wide product function reorganization into stream-aligned, platform and supporting team structures, enabling rapid development of parallel work streams and product-driven technical roadmap during scale from 2,000 to 10,000 customers and in preparation for national scale. * Conceived and led engineering effort to establish Continuous Integration and Delivery across engineering organization * Established work initiation process and architecture team to enable product-based prioritization of engineering work * Served as Product Owner for platform team, while managing a team of embedded SREs, to deliver PaaS to backend, front end and Data Engineering/Data Science teams * Architected and oversaw platform rewrite from specific application logic for third party application core on DC/OS to generic PaaS abstraction layer over Kubernetes, providing an on-demand API and web interface for instantiating dynamic application environments ; oversee DC/OS to Kubernetes platform migration with zero product downtime
- Site Reliability Engineer
  Jun 2018 - Apr 2019 · 11 mos
  First SRE at cloud startup acquired by Fortune 50 regulated insurer. * Developed and built Kubernetes and Terraform-driven infrastructure supporting web/mobile application and complex data engineering/event processing pipeline * Piloted split of Infrastructure team into separate Operations team, Platform and Reliability teams, migrating company infrastructure from ad hoc to a Kubernetes-based platform, while hiring, training and leading the new 8-person Platform & Reliability team to prepare for national scaling * Developed technical plan for preparing infrastructure management for scale Encoded complex regulatory workflows into modular IAC code (mix of Terraform, Python running in AWS Lambda), allowing engineers to reuse and combine them. * Provided operational and support for platform and related infrastructure components, including Kafka/Cassandra-driven Data Science pipeline, including on-call responsibilities and distributed systems troubleshooting
Lead Platform Engineer at The Standard
Apr 2017 - Jun 2018 · 1 yr 3 mos
Spearhead a new engineering team implementing infrastructure as code as part of a full-organization DevOps/Agile transformation in a regulated industry. • Technical Lead for new team, transforming traditional/legacy infrastructure into Phoenix server, pull request-driven, cloud native environment. • Design, develop and enforce coding standards, testing frameworks, and integration tools bridging custom enterprise software with COTS tools such as vCenter, InfoBlox, ServiceNow and Venafi, utilizing collaboration tools including Jenkins, JIRA and Slack and git and tied together with our own toolsets in Ruby, Packer, Ansible, Python and Groovy. • Manage backlog and coach organization on transformation to product model; run cross-functional grooming, planning and story-writing sessions and develop PRDs for new products and services. • Serve as internal Product Owner for Automation Community of Practice, building out the delivery pipeline and bringing server delivery time from 120 days to 10 mins.