Richard Chacon

SR Site Reliability Engineer/Release Engineer

Pittsburg, California, United States

About

Results-driven Site Reliability Engineer with extensive experience in designing, implementing, and maintaining robust cloud-native systems. Committed to enhancing operational efficiency and ensuring high availability for complex architectures supporting millions of users. Proficient in AWS, Kubernetes, and CI/CD pipeline development, with a proven track record in automation, security remediations, and incident management. Eager to contribute expertise in optimizing infrastructure and driving innovation in a mission-oriented environment focused on operational excellence.

Experience

  • Cloud Engineer at Chacon Architecture
    Dec 2024 - Present · 1 yr 8 mos

    Implemented compile, publish and release CI workflows using GitHub Actions Design, deploy, and manage AWS infrastructure components including VPCs, subnets, route tables, Internet Gateways, NAT Gateways, and security groups following best practices for network segmentation and security. Provision and configure EC2 instances (Windows) optimized for specific architectural workloads, including high-CPU/GPU instances for Revit rendering and model manipulation. Analyze and troubleshoot performance bottlenecks related to network I/O, compute, and storage, especially impacting Revit and other CAD applications. Manage remote access solutions (e.g., AWS Client VPN, Direct Connect, or third-party gateways) to ensure seamless and secure access for architects to Revit files and centralized resources.

  • SR Site Reliability Engineer at Broadridge
    Nov 2021 - Dec 2024 · 3 yrs 2 mos

    Design and implementation of a new monitoring and alerting system using DataDog and Cloudwatch resulting in a **30% reduction in mean time to detection (MTTD)** for critical incidents. On call support using PagerDuty's incident response process in addition to writing post-incident root cause analysis and leading no- blame meetings. Maintained Redhat 9 linux servers on ec2 and developed process for end-of-life OS in-place upgrades CentOS 7 to Rocky 8. Implemented Datadog SLI/SLO on key services to regulate deployment velocity based on error bugests. Improved AWS workload security by implementing AWS Inspector with Service Manager resulting in 30% vulnerability reduction Optimized application performance through capacity planning, load balancing, and performance tuning, resulting in a **10% improvement in application response time**. Supported deployments to Amazon Elastic Kubernetes Service cluster using Jenkins, Terraform, eks-blueprints-addons, GitHub Actions and Helm Troubleshooting EKS pods, services, deployment problems using kubectl debug and K9s Reduced build failures by 25% by troubleshooting complex nightly CI/CD pipelines using Jenkins, Terraform, GitHub Actions Reduced AWS operational costs by 15% through right-sizing EC2 instances based on optimizer results Worked with Splunk to troubleshoot application errors in logs supporting IBM MQ and Kafka Developed Python lambda function for handling tasks on EC2s like stopping/starting instances and checking configurations. Migrated legacy AWS infrastructure deployed manually through the console to Terraform . Troubleshooting Linux system level problems on Kafka running on Redhat 9.x Configured DataDog agents to integrate with Kafka, Postgres and AWS custom alerts and dashboards. Automation scripting projects in python and bash to reduce toll and execute auto-remediation job to fix problems detected from

  • Site Reliability Engineer (Contractor) at Pacific Gas and Electric Company
    Nov 2019 - Nov 2021 · 2 yrs 1 mo

    Supported PG&E MRAD On-prem running with Couchbase, Sync-gateway, NodeJS, Nginx, MongoDB with AWS integrations Worked on MRAD AWS migration team responsible for migrating on-prem services to AWS using Agile methodology Resolved Prisma security vulnerability alerts found in AWS by applying the fixes in Terraform Imported private PG&E certs into AWS Certificate Manager Automated Couchbase Backups using bash, cbbackupmgr supporting full/incremental backups with uploads to AWS S3 Converted Nginx automated deployments from Capistrano to Ansible, with Jinja templating engine, Python, GitHub, Jenkins and YAML files, simplifying cut over of services to AWS Wrote automation script for stopping/starting, checking status, and validating Couchbase server, Sync-gateway, PM2, NodeJS services, Nginx, and MongoDB before and after automated patching Maintained Nginx servers including upgrades and performance tuning, recommend best practices and automated deployments using Ansible and Jenkins Monitored RDS, EC2 using CloudWatch On call support using PagerDuty's incident response process in addition to writing post-incident root cause analysis

  • Site Reliability Engineer at Cornerstone OnDemand
    Jun 2018 - Nov 2019 · 1 yr 6 mos

    Member of DevOps Team supporting Grovo Microlearning framework. Supported public cloud infrastructure (AWS) - IAM, EC2, ELB, VPC, RDS, S3, Route53, ElastiCache, Automated clean up jobs using Python with Amazon API to control instance operations. Work with Terraform to add/delete IAM users and deploy web services through AWS. Maintained availability, and performance of Amazon Elastic Compute Cloud (Amazon EC2) instances. Implemented monitoring for AWS resources, Mesos/Docker micro-services using DataDog and Cloud Watch. Created Docker files for building/deploying micro-services Supported Infrastructure as code with Chef and Terraform Created procedure for cleaning up disk space on Elasticsearch PagerDuty On-call rotation

  • LinkedIn (On-site)
    • Site Reliability Engineer
      Aug 2014 - Jun 2018 · 3 yrs 11 mos

      Filled the mission-critical role of ensuring that our web-scale systems are healthy, monitored, automated and designed to scale. Worked closely with development teams from the early stages of design through identifying and resolving production issues Automated the creation and release of yaml alert templates with python and Jinja. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Practice sustainable incident response and blameless postmortems. On-call production rotation. Horizontal Initiative Assist in the roll-out Java7 and Java8 across non-prod and Production. Horizontal Initiative Assist with JVM container upgrades. Worked closely with Tools team on internal tools migration. Assisted with Data center rollouts. Worked on General Data Protection Regulation (GDPR) the Data Protection Directive Worked on Redhat 6 to Redhat 7 migrations for (GDPR)

    • Release Engineer
      Jun 2010 - Aug 2014 · 4 yrs 3 mos

      Work closely with Site operations, engineering and QA to plan and execute efficient and reliable procedures for deploying newly developed code to rapidly growing Java/J2EE application infrastructure. Review the network services and code to be deployed with each release, and assist the team with developing an overall release plan. Develop tools to automate the configuration and deployment of J2EE based services within a large-scale Solaris environment. Validate that newly deployed services are correctly configured and functioning properly per engineering and operations specifications. Contributed to the documentation of complex changing environment. Responsible for creating Release Plan for deployment of Daily content and Hotfixes to Production on short notice. Responsible for maintaining integrity correct versions of hundreds of Java applications on Beta, Alpha and Production environments using internally deployed tools.