Berlin Metropolitan Area
Senior Site Reliability Engineer with over 17 years of experience specializing in observability, infrastructure reliability, and incident management. I have a proven track record leading strategic migrations and enhancing observability platforms through robust tooling and distributed tracing solutions. My approach is data-driven, leveraging metrics and SLIs/SLOs to drive improvements in system reliability, performance, and cost efficiency. As a technical leader, I've consistently set high standards, mentored teams to excellence, fostered strong cross-team relationships, and proactively identified and resolved complex problems. My passion is creating scalable, secure, and observable platforms that enable developers to operate efficiently and reliably in fast-paced environments.
1KOMMA5° is a German CleanTech startup focused on carbon-neutral energy solutions, accelerating renewable energy adoption for homes and businesses, with operations across multiple countries. Role: Leading infrastructure modernization through Infrastructure as Code (IaC), optimizing observability and refining incident management processes. Responsible for significant cost-reduction initiatives, mentoring teams, and enhancing overall reliability practices.
Preply is an online language learning platform connecting learners globally with private tutors, with around 500k active learners. Role: Driving company-wide SRE culture, managing reliability, financial optimization (FinOps), pipeline efficiency, and developer experience. Responsible for defining and managing SLIs/SLOs, incident management, and developing solutions to maintain system resilience.
My role here is as a software engineer focusing on resilience, mostly on architecture review. I was the tech lead for their migration of RabbitMQ services, which had around 400M messages per day, without downtime. That was my contribution, which mostly impacted different product teams because all the async communication in the company was using this service, which means 70% of their P1 and P2 services were affected. The main role here was to review the architecture and work closely with the teams to cover gaps in the infrastructure, observability (Datadog), and provisioning. I planned and executed the migration from a hosted RabbitMQ to AmazonMQ (Lift and Shift). This project reviewed the whole infrastructure and improved the reliability, and security with Hashicorp Vault, using Terraform to offer end-to-end provisioning. I was a member of the Kubernetes tiger team to design, plan, and execute the migration from Nomad to Kubernetes. This integration already brings the monitoring and health checks implemented in Datadog. Worked with product teams to identify opportunities to improve costs, reliability, and observability, after the collection of the data, a golden path was created using infrastructure as code to deliver those benefits out of the box in the right abstraction level making it easier to consume and accelerate the adoption company-wide.
My role was a software engineer focusing on improving their Observability. I advocated for OpenTelemetry and used my experience from Hellofresh to deliver better connections between the teams. I mostly worked to deliver a new implementation using distributed tracing in their Java applications. My role was around their Observability with distributed tracing implementation using OpenTelemetry, Prometheus, Kubernetes, and Grafana. Incident Management. I worked on the project to create Terraform modules to deploy AKS clusters with best practices baked in. I was mentoring the SRE team members and the product teams about best practices on observability and handling incidents.
I had to apply the best practices when it comes to Reliability, Observability, Monitoring, Containerisation, Performance, Security, etc. Build infrastructure automation on a scale Own the solution-wide alerting strategy in the tech organization Optimize the incident management systems, policies, and procedures Ensure the engineering organization has self-service observability tools, and advocate for observability best practices Drive positive change in MTTD, MTTR and MTBF metrics Guide and educate the engineering organization about operations and reliability Remove toil in infrastructure through automation Pair with both your squad members and engineers in the greater organization to spread SRE knowledge and best practices Undertake measured, methodical troubleshooting of complicated systems Implementation of SLO, SLI, Error Budget. Technical Stack: Monitoring: Prometheus, Grafana, AlertManager, Thanos. Logging: Graylog Tracing: Jaeger, Opencensus, Honeycomb. Automation: Terraform, Ansible. Code: Python, Go. Container Runtime: Kubernetes, Docker. Cloud Runtime: AWS