How Uber Built its Observability Platform
Uber built an in-house observability platform called M3 to cope with huge quantities of data, as the business grew with breackneck speed. Martin Mao, former manager on the team, shares the story.
The ride-hailing service’s open-source, large-scale metrics engine called M3, is an impressive piece of software. When I worked at Uber, my team – Rider Payments – was among the first to onboard this new and internal platform in 2017. We needed to monitor payments business metrics across hundreds of major cities, and get alerted when any payment method saw regressions on either iOS or Android: and M3 was built to support such use cases.
Martin Mao headed up M3’s engineering team at the time, from the writing of the first line of code for it, until it was rolled out across the organization. He’s since left Uber to co-found Chronosphere, an efficient observability platform. I caught up with Martin, who believes the current production version of M3 is probably still one of the largest in the world, scale-wise. In 2018, it processed 600 million data points per second (!!), and this has only grown since.
I asked Martin how they built M3, and what he learned from leading that effort. Today, we cover:
Why Uber needed a new observability platform
How M3 was born
Rolling out M3 across the company
Engineering challenges
The evolution of M3, 2016-2019
Becoming manager of the observability team
Learnings from building an observability product and team from scratch
With that, it’s over to Martin, who narrates the rest of this article.
1. Why Uber needed a new observability platform
In 2015 when I started at Uber, the observability platform was a Graphite, Carbon and WhisperDB stack, which had several issues:
Not horizontally scalable: it was not possible to add capacity to the system just by adding more machines
No replicas: if a node died, the company lost data. There was only a handful of nodes, so one node dying meant the loss of around 1/8th of all Uber’s data!
Adding capacity took too long: the system needed to be taken offline for a week or more, to add capacity
During the first week I spent oncall, all I did was delete data from the backend to free up much-needed space and keep the observability stack up and running.
In order to try and fix these issues, we built a better observability platform from scratch in around 6 months, using these technologies:
Cassandra: for the time series database
ElasticSearch: for the metrics index
Go: for all the custom code we wrote
We managed to stand up the new system in time for Halloween, 2015, which was Uber’s second largest peak load event of the year, behind New Year’s Eve. That year was the first time Uber’s observability system did not have an outage during the Halloween peak!
At the time, Uber was growing 10x year-on-year in data volume. We could barely keep up, and kept adding Cassandra and ElasticSearch clusters, pushing both technologies to their limits. At one point, we operated around 700 Cassandra nodes! As a team, we knew we couldn’t keep adding node upon node, long-term, We needed a more scalable approach.
We pitched a project to leadership to do a rewrite of the observability stack, arguing that the observability stack we’d just put in place with Cassandra was just a quick workaround. We argued that the only sensible option was to build a new one, if Uber wanted to continue having visibility of:
Infrastructure
Microservices
Realtime business metrics like number of rides, amount of payments processed
Buying was not an option because there was no vendor that could handle Uber’s scale in terms of data to be processed, at the time.
2. How M3 was born
Split-brain issues were a major reason why we proposed a rewrite. Split-brain occurs when two nodes lose the ability to synchronize, usually as a result of data corruption or other inconsistencies. Back then, we used Cassandra as a time-series database, even though it was built as a key-value store. In hindsight, we didn’t choose the right tool for the job, and this was made worse by how frequently we suffered networking issues at data centers we operated: