Post by trivago

151,966 followers

Sometimes a system looks perfectly healthy, and still can't keep up. 🤔 That's the puzzle Shen ZhongLi, Backend Software Engineer at trivago, spent weeks trying to solve. CPU low, no blocked threads, no errors: every indicator said everything was fine. But our Kafka consumer kept falling behind, critical alerts kept firing, and the usual fix of throwing more pods at it had stopped working. 🧩 Turns out there wasn't one thing wrong. There were three, and they'd been quietly compounding for years. Finding all three, and understanding how they interacted, cut our infrastructure costs by 83% and ended a long streak of incidents. Not by scaling up, but by having the courage to simplify. Shen ZhongLi wrote up the full investigation: the false starts, the debugging process, and the lesson about problems that only show up in combination. Read the full story on the trivago tech blog: https://bit.ly/4a4CxHQ 🔗

Post content