The Pragmatic Engineer

The Pragmatic Engineer

Share this post

The Pragmatic Engineer
The Pragmatic Engineer
How Uber Built its Observability Platform
Copy link
Facebook
Email
Notes
More

How Uber Built its Observability Platform

Uber built an in-house observability platform called M3 to cope with huge quantities of data, as the business grew with breackneck speed. Martin Mao, former manager on the team, shares the story.

Gergely Orosz's avatar
Gergely Orosz
Sep 12, 2023
∙ Paid
90

Share this post

The Pragmatic Engineer
The Pragmatic Engineer
How Uber Built its Observability Platform
Copy link
Facebook
Email
Notes
More
2
4
Share

The ride-hailing service’s open-source, large-scale metrics engine called M3, is an impressive piece of software. When I worked at Uber, my team – Rider Payments – was among the first to onboard this new and internal platform in 2017. We needed to monitor payments business metrics across hundreds of major cities, and get alerted when any payment method saw regressions on either iOS or Android: and M3 was built to support such use cases.

Martin Mao headed up M3’s engineering team at the time, from the writing of the first line of code for it, until it was rolled out across the organization. He’s since left Uber to co-found Chronosphere, an efficient observability platform. I caught up with Martin, who believes the current production version of M3 is probably still one of the largest in the world, scale-wise. In 2018, it processed 600 million data points per second (!!), and this has only grown since.

I asked Martin how they built M3, and what he learned from leading that effort. Today, we cover:

  1. Why Uber needed a new observability platform

  2. How M3 was born

  3. Rolling out M3 across the company

  4. Engineering challenges

  5. The evolution of M3, 2016-2019

  6. Becoming manager of the observability team

  7. Learnings from building an observability product and team from scratch

With that, it’s over to Martin, who narrates the rest of this article.

1. Why Uber needed a new observability platform

In 2015 when I started at Uber, the observability platform was a Graphite, Carbon and WhisperDB stack, which had several issues:

  • Not horizontally scalable: it was not possible to add capacity to the system just by adding more machines

  • No replicas: if a node died, the company lost data. There was only a handful of nodes, so one node dying meant the loss of around 1/8th of all Uber’s data!

  • Adding capacity took too long: the system needed to be taken offline for a week or more, to add capacity

During the first week I spent oncall, all I did was delete data from the backend to free up much-needed space and keep the observability stack up and running.

In order to try and fix these issues, we built a better observability platform from scratch in around 6 months, using these technologies:

  • Cassandra: for the time series database

  • ElasticSearch: for the metrics index

  • Go: for all the custom code we wrote

We managed to stand up the new system in time for Halloween, 2015, which was Uber’s second largest peak load event of the year, behind New Year’s Eve. That year was the first time Uber’s observability system did not have an outage during the Halloween peak!

At the time, Uber was growing 10x year-on-year in data volume. We could barely keep up, and kept adding Cassandra and ElasticSearch clusters, pushing both technologies to their limits. At one point, we operated around 700 Cassandra nodes! As a team, we knew we couldn’t keep adding node upon node, long-term, We needed a more scalable approach.

We pitched a project to leadership to do a rewrite of the observability stack, arguing that the observability stack we’d just put in place with Cassandra was just a quick workaround. We argued that the only sensible option was to build a new one, if Uber wanted to continue having visibility of:

  • Infrastructure

  • Microservices

  • Realtime business metrics like number of rides, amount of payments processed

Buying was not an option because there was no vendor that could handle Uber’s scale in terms of data to be processed, at the time.

2. How M3 was born

Split-brain issues were a major reason why we proposed a rewrite. Split-brain occurs when two nodes lose the ability to synchronize, usually as a result of data corruption or other inconsistencies. Back then, we used Cassandra as a time-series database, even though it was built as a key-value store. In hindsight, we didn’t choose the right tool for the job, and this was made worse by how frequently we suffered networking issues at data centers we operated:

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Gergely Orosz
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More