Robert Irelan

Making compute capacity available for AGI

San Francisco Bay Area

About

Site Reliability Engineer / Software Engineer interested in diagnosing, designing, implementing, and troubleshooting large-scale Cloud platforms, with a focus on Machine Learning infrastructure. Proven record in leading multi-stakeholder projects that require highly efficient and practical solutions. Experience with rapidly learning new codebases/systems and effectively leading engineering teams to achieve independence. Core competencies: - Kubernetes, Go, Python, Java, Linux - Automation, observability, infrastructure deployment - Incident management and on-call - Technical project and people management

Experience

  • Member of Technical Staff at OpenAI
    Oct 2024 - Present · 1 yr 9 mos

  • TikTok (Full-time · 2 yrs 10 mos)
    • Team Lead, Applied Machine Learning SRE
      Feb 2022 - Oct 2024 · 2 yrs 9 mos

      - Team lead for two machine learning systems: a ML inference and a ML training Kubernetes cluster, plus associated API, web frontend, service discovery, monitoring, testing, and deployment infrastructure. - Brought 9-member team of new US hires (junior and senior) to 100% operational independence from original non-US service owners, and kickstarted a nucleus of software engineering expertise on these systems within the US team. - Designed and implemented projects to improve service redundancy, GPU utilization, ease of debugging for client teams, and deployment reliability via improved end-to-end testing. - Lead on new data center turnup for the team's services. - Mentor team members on individual pull requests as well as larger-scale designs - Plan near- and long-term (1yr+) team goals. - Foster collaboration within the team, between my team and US as well as non-US leadership.

    • Site Reliability Engineer
      Jan 2022 - Feb 2022 · 2 mos

      Applied Machine Learning

  • Google (San Francisco Bay Area)
    • Site Reliability Engineer
      Sep 2015 - Jan 2022 · 6 yrs 5 mos

      Apps Core Infrastructure manages the internal settings database used by users of all Google Workspace (i.e., Google Apps for Enterprise) applications (>1M QPS) as well as the Groups infrastructure used by all internal and many external users (>100k QPS). Settings database: Lead on the effort to migrate internal clients of the Google Workspace customer settings database (>10k QPS) from an 10-year-old implementation to a much more performant datastore to enable the continued scaling of the settings database. Groups: took advantage of new infrastructure to measure metrics for the service level agreement (SLA) for 20+ individual RPCs, enabling us to successfully achieve a higher SLA for the service by identifying regressions impacting SLA. Groups: implemented server-level throttling that enabled the server to shed bursts of traffic that were reducing the availability of the server, unblocking a critical feature launch. Hosted summer intern, enabling them to deliver two projects requiring Google engineer-level design and implementation work by end of internship. As lead, migrated services implementing Google Vault (used for legal discovery) to the microservice-based standard listed below. Multi-quarter project requiring deep knowledge of the services and their dependencies to avoid disrupting service for this critical revenue-generating service. Lead on development of a tool to automate setup and migration of 100+ large-scale user-facing services to a microservice-based standard, in order to allow engineers to successfully manage these services with less need for idiosyncratic knowledge. Automated the migration of Google Workspace services between data centers to take advantage of new capacity. Maintaining high availability of the distributed systems behind Google Workspace, including developing monitoring and resolving major incidents. Performed >100 interviews for software and site reliability engineering candidates, including for coding and system design.

    • Site Reliability Engineer
      Dec 2013 - Sep 2015 · 1 yr 10 mos

      * Google Calendar and Sites Used the Go programming language to develop a significant system (~10K LOC) to automate a previously manual infrastructure maintenance process. Analyzed the code architecture of a long-lived user-facing service in order to quantify its current 30-day active users and perform maintenance (using Java). Wrote code to verify backup and restore process integrity for Google Sites Maintaining the high availability of the distributed systems behind Google Calendar and Sites, including developing monitoring and quickly solving pages and other customer-facing incidents.

  • Server Systems Technical Services at Epic
    Jul 2012 - Nov 2013 · 1 yr 5 mos

    Learned how to manage the Intersystems Caché database system on several Unix platforms, including Linux and AIX. Created and implemented tests for our customer-released Perl administrative scripts.

  • Rice University (3 yrs 6 mos)
    • Graduate Research Assistant
      May 2010 - May 2012 · 2 yrs 1 mo

      Created and implemented a DFT hybrid functional in the Gaussian computational chemistry program

    • Undergraduate Research Assistant
      Dec 2008 - May 2010 · 1 yr 6 mos

      Atomic symmetry calculations, DFT functionals in the Gaussian computational chemistry program