Ashish Kumar, PMP®, ITIL®

Vice President – Production Engineering & Site Reliability (SRE) | Capital Markets Technology | Cloud (AWS & Azure) | Driving Reliability, Resilience & Service Transformation at Scale

Greater Toronto Area, Canada

About

Vice President with 15+ years of experience leading Production Engineering, Site Reliability Engineering (SRE), and Service Operations for mission-critical Capital Markets platforms. Currently leading 24x7 production engineering and reliability support for Equities Risk applications at Citi, ensuring high availability, resilience, and performance of business-critical trading systems across multi-cloud (AWS, Azure) and on-prem infrastructure. Core strengths include: * Production Engineering & SRE transformation * Major Incident & Crisis Management (P1/P2) * Service reliability and MTTR reduction through automation * Observability strategy (AppDynamics, Splunk, ELK, Datadog, Dynatrace) * Cloud operations across distributed systems * ITIL-led governance (Incident, Problem, Change Management) Proven track record of improving service stability, reducing recurring incidents through Root Cause Analysis (RCA), and enabling proactive monitoring and system observability. Experienced in leading cross-functional collaboration across Development, Infrastructure, Architecture, and Business stakeholders to deliver resilient, scalable, and high-performing systems in high-pressure, real-time financial environments. Recognized for strong stakeholder communication, leadership during critical incidents, and driving continuous service improvement initiatives.

Experience

  • Citi (5 yrs 9 mos)
    • Vice President
      Jun 2022 - Present · 4 yrs 1 mo

      Lead Production Engineering for Capital Markets (Equities Risk) platforms, ensuring high availability, operational resilience, and service stability for mission-critical trading systems. * Lead and mentor a high-performing team delivering 24x7 production support for critical Equities Risk applications within Capital Markets. * Own major incident management (P1/P2), leading crisis bridges, stakeholder communication, and rapid service restoration in high-impact environments. * Drive adoption of Site Reliability Engineering (SRE) principles to improve system reliability, scalability, and operational efficiency. * Implement and enhance observability frameworks using AppDynamics, Splunk, ELK, Datadog, Dynatrace, and Geneos for proactive monitoring. * Perform advanced troubleshooting and Root Cause Analysis (RCA) across distributed cloud and on-prem systems. * Drive service improvement initiatives through automation, monitoring optimization, and process enhancements. * Collaborate with Development, Infrastructure, and Business teams to ensure service stability during live market operations. * Provide regular service performance reporting and operational insights to senior stakeholders. * Lead Disaster Recovery (DR/COB) strategy and testing for multiple applications, ensuring business continuity. * Govern Change Management processes, ensuring release readiness and minimal production risk. * Maintain high service standards aligned with ITIL frameworks across Incident, Problem, and Change Management. * Manage vendor dependencies and ensure operational continuity across third-party services.

    • Assistant Vice President
      Oct 2020 - Jul 2022 · 1 yr 10 mos

      Delivered production support and reliability engineering for distributed applications across cloud and on-prem environments. * Managed end-to-end Incident Management lifecycle including triage, impact assessment, and resolution. * Led Major Incident Management (MIM) bridges to ensure timely resolution of critical production issues. * Implemented monitoring and observability using industry-standard tools to improve system visibility. * Supported cloud migration and onboarding of applications to AWS environments. * Improved operational efficiency through automation and knowledge management.

  • Information Technology Analyst at Tata Consultancy Services
    Mar 2019 - Sep 2020 · 1 yr 7 mos

    Delivered production engineering support for high-availability applications in the telecom domain. * Led Incident Management and Root Cause Analysis for production-critical systems. * Acted as Incident Commander for major incidents, ensuring rapid resolution and minimal business impact. * Implemented observability using Dynatrace, ELK, AWS CloudWatch, and Azure Monitor. * Supported containerized applications using Docker and Kubernetes (AKS/EKS). * Drove automation and process optimization initiatives to improve operational efficiency. * Produced service performance reports to track system reliability and incident trends.

  • Technology Analyst at Infosys
    Jul 2011 - Jan 2019 · 7 yrs 7 mos

    Supported middleware and enterprise applications within banking systems. * Managed IBM WebSphere and Oracle WebLogic environments ensuring high availability. * Performed application deployments, configuration, and integration across distributed systems. * Implemented security and compliance configurations (LDAP, SSL) * Conducted performance tuning and capacity planning for enterprise applications. * Supported Incident, Problem, and Change Management using ITIL processes. * Executed Disaster Recovery planning and testing to ensure system resiliency.