Nikhil ..

Principal Site Reliability Engineer - Cloud Operations at Barracuda

Atlanta, Georgia, United States

About

Experienced (competent) Cloud administrator with an ability to undertake complex assignments, meet deadlines and deliver superior performance. Having 9 years of experience of experience in IT Industry with ability to accomplish all aspects of the software configuration management (SCM) process, DevOps and Build/Release management. Well versed with configuration management, version control systems, build and deployment automation tools, Continuous Integration and Delivery, Management of applications servers and Release processes. Proficient in Build & Release automation framework designing, Continuous Integration and Continuous Delivery, Build & release planning, procedures, scripting & automation Good at documenting and implementing procedures related to build, deployment and release. Good at Infrastructure provisioning, configuration management and integration with Ansible. Good experience in life cycle automation of Build, Deploy and Release of products built on Java using technologies such as Ant, Maven, Gradle, Hudson, Jenkins, Tomcat, Jboss, Weblogic, Web sphere, Postgres, Rundeck, Amazon products (Ec2, Cloud watch, SNS, IAM, S3) Shell scripts. Strong understanding of AWS technologies (EC2, RDS, ELB, EBS, S3, VPC, Route 53, cloudwatch, SQS) Experience with migration to Amazon web Services AWS. Experience with CI (Continuous Integration) and CD (Continuous Deployment) methodologies using Jenkins and Rundeck Experienced in build tools such as Apache Ant, Maven, Gradle. Strong hands on development and configuration experience with software provisioning tools like Ansible. Highly organized, detailed oriented, able to plan, prioritize work and meet deadlines; work well under tight deadlines Ability to work directly with all levels of Management to gather user requirements.

Experience

  • Barracuda (Full-time · 5 yrs 6 mos)
    • Principal Site Reliability Engineer
      Nov 2022 - Present · 3 yrs 9 mos

      Building Kubernetes as a Service (KaaS) - Platform All things Kubernetes. World Wide Kubernetes Clusters Architecture Multi-Tenant Kubernetes Architecture

    • Staff Site Reliability Engineer
      Feb 2021 - Nov 2022 · 1 yr 10 mos

      All things CNCF , Kubernetes Architecture , Multi-Tenant Architecture , Terraform // Designing simplified solutions.

  • Senior Site Reliability Engineer at Steady
    Nov 2018 - Jan 2021 · 2 yrs 3 mos

    o Project Management, Standups, 1:1, Planning, TechDebt, Grooming, Team Goals, Retro’s, Documentation. o KOPS Cluster, supported end to end prod kops cluster, Later migrated to EKS clusters. o EKS Migrations, Architected, Deployed and Maintained Production EKS Clusters migrated from KOPS clusters. o AWS Stack, Architected, Maintained Underlying AWS resources to support entire Steady Tech Stack. o Kubernetes Maintenance, in a constant effort of maintaining cutting edge technology PDB, HPA, Taints, Optimizing Apps to Specific Nodes, Auto Scalier, Roll Out Strategies. o Service Mesh, Service discovery, load balancing, Circuit breaking, Traffic shadowing, Telemetry. o Heterogeneous EKS Cluster, running diversified AWS node group Dedicated, Reserved, Spot Instances. o CI/CD, migrated in cluster drone and spinnaker to dedicated Jenkins to manage multiple k8s environments., o Kafka Implementation, Migrated from in cluster confluent Kafka to MSK deployed, maintain MSK in different env streaming billions of messages every single day. o Data Streaming Pipeline, Architected, Implemented, Maintained, Kafka Data Steaming pipeline which streams billions of messages on a day to day basis (MSK , K-Connect, K-Streams, K-REST, Kafka UI ) o Intrusion Detection system, Implemented Falco rules engine in k8s based on MITRE ATT&CK framework o Incident Response, Victor Ops Incident model, Blameless Postmortem, Root Cause, Resolution/Fix, Feedback. o Security, AWS security hub, Aqua tools, K8s Chaos, Polaris, Popeye, kube-score o Git-Ops, brought a Strong policy on change should be going through Code change and traced to origin & audited o Monitoring Stack, Migrated from Prometheus Operator to New relic, Fallback Istio Prometheus o Infrastructure & Monitoring as code, as we team started growing set a policy on every infra configuration should be IAC and MOC Approach o Service Desk, Implemented Service Desk Respective to teams which helped to ↓ 40% request routing from ops

  • Site Reliability Engineer at Cardlytics
    Jun 2018 - Nov 2018 · 6 mos

    > POC and Implemented Slack end to end and helped company to migrate from hip-chat to slack. > POC and Implemented Incident management system - Ops-genie end to end for the Engineering. > Implemented opsgenie integration's and routed to proper on call teams to their respective slack channels. > Tightly integrated ops-genie with slack and created individual on call channels for a better visibility. > worked closely with production support team to push out proper data dog alerts to respective ops-genie slack channels for a proper incident management. > Worked With prod support team to help cleanse the datadog alerts > Created Monitors and Screen boards for Sprint teams to get better visibility on applications. > revamped datadog alerts by implementing static alert triggers (TCP, SSL, DNS, NTP, DISK, HTTP ....) > Worked With HubOT to implement automation bots to sprint teams to leverage faster work flow > Working with kubernetes - Rancher > Rancher Helping sprint teams to migrate from docker swarm to k8s > Helping teams work with docker easily by introducing portainer and helping teams to maintain local docker infrastructure > Worked with other sprint teams and created solutions for a faster work flow. > Implemented Chatops from scratch and helped company grow towards chatops culture > Writing ansible playbooks - we use AWX tower for ansible workflows > Working with cross region (london) teams to help coordinate and build out entire AWS infrastructure. > Designed and built and deployed entire AWS infrastructure using Terraform > Wrote Terraform scripts from scratch and approached modular architecture for a highly pluggable infrastructure. > Dockerized RASA Core And RASA NLU AI tools from Onprem servers to docker. > Working with Senior Management to strengthen aws security using infrastructure as code approach > Deployed AWS Security Monkey and Scout 2 to visualize discrepancy in aws infrastructure > Pushed company to adopt Proper Documentation.

  • Site Reliability Engineer at Verinon
    Jan 2017 - May 2018 · 1 yr 5 mos

    Responsibilities: • Actively involved in architecture of DevOps platform and cloud solutions. • Responsible for Regular Build jobs are initiated using the Continuous Integration tool Jenkins. • Created proper documentation for new server setups and existing servers. • Maintaining a farm of EC2 instances, ELB’s and RDS. • Working with daily tasks like Monitoring and DNS issues to Route53, checking connectivity etc. • Provisioning Amazon EC2 instances using Ansible. • Using AWS IAM policy simulator tools for AWS security and scanning. • Managing the Logs and writing script to move the logs to central repository. • Apache-tomcat Web Server Configuration and Management. • Helping Customers to migrate their applications to the aws and configure the monitoring metrics. • Sending the Uptime and Downtime notifications to teams regarding Servers Status as a part of the Build Engineer role at the time of deploying the EAR and WAR package in Tomcat Admin Console. • Wrote playbooks to managing the Private Cloud Environment using Ansible. • Managed and optimize the Continuous Delivery tools like Run deck. • Automate Continuous Build and Deploy Scripts for Hudson/Jenkins Continuous Integration tool. • Develop Custom Scripts to monitor uptime and log management and disk read write IOPS and CPU utilization. • Build, configured and support Application team environments • Implemented rapid-provisioning and life-cycle management for Ubuntu Linux using Amazon EC2, Ansible, and custom Yaml scripts • Managed the artifacts Repository using AWS S3 and used the same to share the snapshots and releases of internal projects. • Monitoring critical amazon instances through cloudWatch and pushing notifications through SNS service.

  • Student Developer at University of Central Missouri
    Jan 2016 - Jul 2016 · 7 mos

    • Worked closely with professor to deploy multi node cluster environment. • Deployed HAProxy load balancer to effectively serve the http request and redirect the traffic to separate nodes. • In order to implement round robin serve algorithm to serve http request to the least number of connections server justified by implementing the HAProxy load balancer • Implemented Ansible for quicker and faster deployments to the application server and push the updates by playbooks • Effectively managing the linux serversd. • Implemented “Zero downtime deployments” to the production servers without ever bringing the application down using Ansible. • Creating EC2 Web servers, VPC, RDS databases. • Performing schedule snapshots on EC2 instances. • Installing and maintaining Tomcat servers. • Monitoring the infrastructure using Nagios.