Post by Aviral Aman
Aspiring Product Manager | Scalable Systems, AI/ML & Cloud | Bridging Engineering & Product | EMBA in Analytics IIM Kashipur ’28
“Zero downtime deployments” are not a deployment strategy. They’re a management discipline. In distributed microservice systems (where workflows are often shaped like DAGs at a high level), a single breaking change in a downstream service can ripple across the entire system. Now, imagine what this architecture may look like at a slightly lower level on AWS: A centralized networking/orchestration layer (for example, something like a Transit Gateway) sits in the middle, routing and enabling communication between multiple services across VPCs. This abstraction simplifies connectivity—but also introduces a subtle risk: → It can amplify the blast radius of a breaking change. The uncomfortable reality Techniques like: • Blue-Green deployments • Side-by-side versioning • Maintenance windows …are tools, not solutions. And used in isolation, they fail. - Blue-Green can introduce data inconsistencies - Side-by-side breaks when DB schemas evolve - Maintenance windows don’t scale for always-on systems The real problem isn’t deployment—it’s coordination In a distributed system: - Some services move faster than others - Contracts evolve at different speeds - Gateways amplify failures instead of isolating them A “breaking change” is rarely just technical—it’s a failure of system-wide coordination What actually works (in practice) There isn’t a rigid playbook here. Zero-downtime systems are built in a grey area, where judgment matters more than doctrine—and being a little scrappy often beats being theoretically perfect. Some patterns consistently help: 1. Compatibility over correctness (initially) If your system can’t handle mixed versions, it’s not ready for continuous deployment. 2. Expand → Migrate → Contract (especially for DB changes) Instead of “changing” systems, we evolve them: Expand: Add new schema without breaking old Migrate: Dual writes + gradual traffic shift Contract: Remove legacy only when safe 3. Progressive exposure, not big-bang releases This turns deployments from risky events → controlled experiments 4. Observability + fast rollback If you can’t detect issues in seconds, your rollout strategy doesn’t matter. The technical management layer: None of the above works without strong technical management discipline: - Enforcing compatibility standards across teams - Aligning release cadence across dependent services - Defining clear API and data evolution policies - Investing in tooling (flags, tracing, contract testing) - Driving a culture of safe change over fast change Because here’s the truth: Most outages during deployments are not caused by lack of tools—they’re caused by lack of alignment. Final thought: Zero downtime isn’t about avoiding failure. It’s about ensuring that when failure happens, users don’t notice. #SystemDesign #Microservices #DistributedSystems #TechLeadership #DevOps #Scalability #AWS #ProductManagemnt