Real-World Engineering Challenges #3
Circular dependencies in the Roblox outage, being a principal engineer at Amazon, fraud detection utilizing JIRA, and more.
👋 Hi, this is Gergely with a bonus free issue of the Pragmatic Engineer Newsletter. If you’re not a subscriber yet, here are recent issues you missed:
Subscribe to get weekly issues. Many subscribers expense this newsletter to their learning and development budget.👇
Real-world engineering challenges is a bonus column. Each month, I share a handful of interesting engineering or management approaches in-depth, where you can learn something new by reading them, and diving deeper into the concepts they mention.
Four failed attempts to determine why Roblox was down for three days
Roblox was down for 73 hours in October, with 50M users/day unable to access the service. The company has published a long and transparent postmortem. Reading it made me feel like I was on the ground, trying to debug why on earth did the cluster of virtual machines go down?
Roblox uses the HashiStack, which consists of the following components:
Nomad: orchestrating containers, scheduling work, batching jobs. It decides which containers are going to run on which nodes and on which ports they’re accessible and validates container health.
Consul: a service discovery and service mesh solution. It maintains a registry with application endpoints and their health. Consul uses the Raft leader election algorithm to distribute state across the cluster.
Vault: secrets management service, used by most services within Roblox. Roblox also uses Consul as a Key-Value store for Vault.
The outage started by Consul becoming unhealthy. At Roblox, both Nomad and Vault depended on Consul clusters, so those services also went down.
The rest of the incidents describes the various attempts the Roblox team did - from trying to restart the cluster, through restoring it to an earlier, presumably healthy state, and how all of these attempts failed. And in a strange observation, the Consul cluster leader kept being out of sync with the other cluster voters.
Roblox worked together with HashiCorp engineers and discovered two root causes that ended up causing the outage:
A Consul streaming feature resulted in blocking writes when the cluster was under high read or write load.
BoltDB - the persistence library powering the Raft leader election protocol in Consul - had a bug where it did not delete old log entries, causing a performance problem that was only detected at the scale that Roblox had.
One of the biggest learnings I had from the outage is how your telemetry should not depend on your infrastructure. All of Roblox’s telemetry was built in-house and depended on Consul. This means the engineers trying to debug what went wrong were flying blind: which explains why it took almost two days to find the root causes.
Update: in an earlier version of the article I incorrectly said Roblox is using Kubernetes. They use Nomad.
Being a Principal Engineer at Amazon
“For the most part, Amazon operates like a chaotic collection of startups with their own agenda, yet it somehow behaves cohesively. PEs are the glue that enables this.”
Carlos Arguelles is currently Senior Staff Engineer at Google, and was a Principal Engineer for six years at Amazon beforehand and shares his learnings and insights.
About 3% of engineers at Amazon are at Principal or above. Most engineers fail their first promotion to Principal - just like Carlos did. In the article, he describes the Principal promotion process, and the differences it has compared to Senior and below ones.
“Principal Engineers have a ridiculous amount of power” - Carlos shares, and compares how Amazon’s philosophy differs from that of Google, Facebook, and Microsoft. Whether it’s possible to tell who is at what level at these companies also varies - I was surprised to learn that at Google, engineers can choose to hide their level.
Carlos writes in detail about the Principal Engineering Community - one that I have heard several people refer to as unique, even among Big Tech. PEs hosted the Principals of Amazon Talks and provided a company-wide Design Review Process that any engineer could utilize. I’m researching more on how Amazon’s PE community works - if you have insights, please hit reply.
Going deep within Uber’s payments fraud detection system
Uber details how its payments fraud system was built and how it operates. As I worked on Uber’s Payments team for years and observed millions being lost - for the company, not Uber customers- thanks to fraudulent attacks, this topic is close to home.
How do you stay on top of millions of transactions per day, every one of which might be fraudulent? Neither a manual review nor a fully automated solution makes sense, as Uber themselves admit:
“Explainability is of paramount importance when it comes to fraud detection. (...) Fraud cannot be stopped by some black-box AI algorithms.”
The pragmatic approach - unsurprisingly - is a hybrid model where AI monitors patterns and gets humans involved when new, potentially fraudulent patterns emerge.
Uber calls their fraud detection and prevention system RADAR. It’s built on top of Apache Spark (a large-scale data processing platform) and Peloton (Uber’s cluster management solution, built on top of Docker containers managed by the Mesos distributed systems kernel).
RADAR uses Kafka event streams with their AthenaX platform built on top of Kafka to analyze these streams. All the data coming from RADAR is ingested via Marmaray, another Uber-built platform, this one built on top of distributed data processing framework Apache Hadoop.
The article further details how Uber trains models to detect fraud attacks, how they analyze segments, and how they prioritize the most important ones to review.
A surprisingly practical approach is how RADAR generates JIRA tickets for Uber’s fraud analysis to review, and JIRA conveniently serves as a trace for human decision trails.
What is fascinating about this summary is both the complexity of building such a system and how many in-house tools Uber utilizes - even if many of these are open-sourced. Peloton, AthenaX, and Marmaray are all Uber’s in-house products, built and maintained by one of Uber’s platform teams.
The key metrics when detecting and removing abusive content on LinkedIn
At least twelve teams work together to prevent abusive content to make it onto the LinkedIn feed. These teams include Content Quality AI, Trust Infrastructure, Multimedia AI, Trust and Safety, Trust Product, Legal, Public Policy, Content Policy, Trust Data Science, Content Experience, and Feed AI. While the number of teams might be surprising for those who have not worked in a similarly large organization, for a company the size of LinkedIn, this setup is normal, with each team having clear missions, charters, and owning a specific part of the product.
When content is posted to LinkedIn, it goes through an automatic filter - AI models deciding within 300ms to filter or not - then humans review content the AI flagged, and finally, humans review human reports.
I liked this article not because it’s technical - it doesn’t go into engineering concepts - but because it’s a reminder that defining the key metrics you are building your system for is even more important than how you build your system. If you have the right key metrics and goals, you’ll build the right systems.
LinkedIn sets radically different key metrics as things they measure at different layers of their content filtering:
Software engineering learnings from two years at Gojek
Gojek software engineer Nishanth Shetty shares things he’s learned throughout his two years at Gojek, sharing these right as he was leaving the company. A few quotes from him:
“Building a system at such a scale is easy if you know how to do it right, but that’s the beauty of our industry: We don’t know how to do it right. 🤷♂️”
“Gojek follows test driven development; the entire org depends on it. Each code the developer writes has to be unit tested, then functional, integration, and a whole other bunch of testing processes. That’s not enough, there is a QA doing extensive testing on the feature you are developing. There are automated test suites running in the CI looking for issues.”
“Writing simple, readable, testable code is what good programmer skills are.”
“We have dedicated channels where people share their findings, RCA session where teams go about the issue they face, fixes, learnings with different teams.”
This blog is published on the official Gojek blog. What sets this post apart from similar articles where people talk about their experience is how this was written by an engineer who has already given in their notice, which is less usual. At the end of the article, Nishanth talks about how he felt burnt out and time for a change - a story that will be all too common for most software engineers these days.
The weekly release steps at GitHub Mobile
A short and sweet summary of the several steps it takes to release GitHub’s mobile app. From the sounds of it, the GitHub mobile team is not a large one. Still, regardless of team size, you cannot skip steps like testing, version bumping, compiling the list of changes, and manual testing.
The thing that caught my attention from this article is this neat visualization of what release steps are automated, and which ones are done by a person on the team. It’s a visualization I’d suggest all mobile teams to copy: and see if you can reduce the non-automated steps.
How would you rate this issue? 🤔