Daniel Cole

Lead Site Reliability Engineer at MongoDB | 10 Years Scaling Observability Infrastructure | 3B Time Series, 100TB/day

New York City Metropolitan Area

About

I'm a Lead Site Reliability Engineer at MongoDB, where I've spent the last 8 years scaling our observability infrastructure to handle 2 billion active time series and process 100TB+ of telemetry data daily across AWS, GCP, and Azure. Most recently, I architected a unified telemetry pipeline that reduced our observability infrastructure costs by 80% while significantly improving data quality and reliability. I've built and mentored engineering teams, growing our observability group from 4 to 8 engineers and championing multiple promotions to Staff Engineer level. Before MongoDB, I spent two years managing Columbia University's 30-rack private data center, where I first implemented large-scale observability using the ELK Stack across 200+ servers. That experience taught me the fundamentals of infrastructure operations and sparked my focus on making complex distributed systems observable and reliable. TECHNICAL EXPERTISE Observability: Prometheus, VictoriaMetrics, Thanos, Grafana, ELK Stack, OpenTelemetry, Jaeger, distributed tracing Infrastructure: Kubernetes, Docker, Terraform, AWS/GCP/Azure, Linux systems, Kafka, Envoy Proxy Languages: Go, Python, C, Bash SRE Practices: SLI/SLO frameworks, incident management, on-call rotation design, telemetry pipeline architecture Let's connect if you're building something interesting in the observability or infrastructure space.

Experience

MongoDB (On-site)
- Senior Lead Site Reliability Engineer
  Aug 2022 - Present · 3 yrs 11 mos
  Lead for the Observability team, interm lead for the Fabric (networking) team
- Site Reliability Engineer
  Jan 2018 - Aug 2022 · 4 yrs 8 mos
  Site Reliability Engineer with a focus on observability and edge load balancing.
System Administrator at Columbia University in the City of New York
Jun 2016 - Dec 2017 · 1 yr 7 mos