The Pragmatic Engineer
The Pragmatic Engineer
Observability: the present and future, with Charity Majors
0:00
Current time: 0:00 / Total time: -1:14:25
-1:14:25

Observability: the present and future, with Charity Majors

In today's episode of The Pragmatic Engineer, I'm joined by Charity Majors, a well-known observability expert – as well as someone with strong and grounded opinions.

Stream the Latest Episode

Available now on YouTube, Apple and Spotify. See the episode transcript at the top of this page, and a summary at the bottom.

Brought to You By

Sonar — Trust your developers – verify your AI-generated code.

Vanta —Automate compliance and simplify security with Vanta.

In This Episode

In today's episode of The Pragmatic Engineer, I'm joined by Charity Majors, a well-known observability expert – as well as someone with strong and grounded opinions. Charity is the co-author of "Observability Engineering" and brings extensive experience as an operations and database engineer and an engineering manager. She is the cofounder and CTO of observability scaleup Honeycomb.

Our conversation explores the ever-changing world of observability, covering these topics:

• What is observability? Charity’s take

• What is “Observability 2.0?”

• Why Charity is a fan of platform teams

• Why DevOps is an overloaded term: and probably no longer relevant

• What is cardinality? And why does it impact the cost of observability so much?

• How OpenTelemetry solves for vendor lock-in

• Why Honeycomb wrote its own database

• Why having good observability should be a prerequisite to adding AI code or using AI agents

• And more!

Takeaways

My biggest takeaways from this episode:

1. The DevOps movement feels like it’s in its final days, having served its purpose. As Charity put it:

“It’s no longer considered a good thing to split up a dev team and an ops team to then collaborate, right? Increasingly, there are only engineers who write code and own their code in production. And I think this is really exciting. We can understand why Dev versus Ops evolved, but it was always kind of a crazy idea that half your engineers could build the software and the other half would understand and operate it.”

Indeed, I cannot name any team at startups or at large tech companies that has a dedicated Ops team. While there surely exist such companies in small pockets – think of more traditional companies operating in highly regulated environments like finance or healthcare – this setup feels like the exception rather than the norm.

2. Lots of people get dashboards wrong! Charity doesn’t think that static dashboards are helpful to engineering teams at all. In fact, misusing dashboards is one of the most common observability practices she sees:

“Unless your dashboard is dynamic and allows you to ask questions, I feel like it's a really poor view into your software. You want to be interacting with your data. If all you're doing is looking at static dashboards, I think it limits your ability to really develop a rich mental model of your software. And this means that there are things that you won’t see; because you did not graph it on your dashboard!”

3. Observability will be especially important for AI use cases in these ways:

  • o11y for LLMs: to get data on how they behave and to be able to debug behaviors. This is relevant for teams building and operating AI models.

  • o11y for code generated by AI: the generated code should have the right amount of observability in place. Once the code is deployed to production, developers need to be able to get a sense of how the code is behaving there!

GenAI means that a lot more code will be generated via LLMs – and all this code needs observability!

The Pragmatic Engineer deepdives relevant for this episode

How Uber Built its Observability Platform

Building an Observability Startup

How to debug large distributed systems

Shipping to production

Timestamps

(00:00) Intro

(04:20) Charity’s inspiration for writing Observability Engineering

(08:20) An overview of Scuba at Facebook

(09:16) A software engineer’s definition of observability

(13:15) Observability basics

(15:10) The three pillars model

(17:09) Observability 2.0 and the shift to unified storage

(22:50) Who owns observability and the advantage of platform teams

(25:05) Why DevOps is becoming unnecessary

(27:01) The difficulty of observability

(29:01) Why observability is so expensive

(30:49) An explanation of cardinality and its impact on cost

(34:26) How to manage cost with tools that use structured data

(38:35) The common worry of vendor lock-in

(40:01) An explanation of OpenTelemetry

(43:45) What developers get wrong about observability

(45:40) A case for using SLOs and how they help you avoid micromanagement

(48:25) Why Honeycomb had to write their database

(51:56) Companies who have thrived despite ignoring conventional wisdom

(53:35) Observability and AI

(59:20) Vendors vs. open source

(1:00:45) What metrics are good for

(1:02:31) RUM (Real User Monitoring)

(1:03:40) The challenges of mobile observability

(1:05:51) When to implement observability at your startup

(1:07:49) Rapid fire round

A summary of the conversation

For those of you more interested in reading a summary of the conversation — or skimming over it — see it here. Takeaways follow after the summary.

Observability (o11y) basics

  • Observability is about understanding software, specifically the intersection of code, systems, and users.

  • It is not just about errors, bugs and outages; it is also about understanding the impact of code.

  • Observability is a tool that is critical for development feedback loops, and is not just an operational tool.

  • The goal of good o11y is to help engineers understand their software in the language of the business.

  • Engineers should be able to tie their work back to top-level goals, and explain how their work translates to the business.

  • Sampling is an important lever, contrary to the idea that every log is sacred.

  • ‘metrics’ vs ‘Metrics’

    • We need to distinguish between metrics (small 'm') as a generic term for telemetry and Metric (capital 'M') as a specific data type, a number with appended tags.

    • The Metric data type is limited because it doesn't sort any contextual relationship data.

  • The Three Pillars Model

    • The three pillars model of observability is this: metrics, logs and traces.

    • Many vendors sell products for each of these pillars – as well as for all of them

    • The problem with the Three Pillars Model is that every request that enters a system is stored multiple times, in different tools (metrics, logs, traces, profiling, analytics).

    • There is little to connect the data points; engineers are left to manually correlate the data.

    • The cost of following this model is high: it’s because storing the same data in multiple tools and databases is very high!

What is Observability 2.0?

  • Observability 2.0 moves away from multiple sources of truth to unified storage.

  • With unified storage, there are no dead ends: engineers can click on a log, turn it into a trace, visualize it over time, and derive metrics and SLOs from it. They can then see which events are violating SLOs.

  • Good observability powers developer feedback loops. It allows engineers to visualize the CI/CD as a trace and see where tests are breaking. The goal is to keep the time between building code and seeing it in production as small as possible.

  • Observability is shifting from being an ops tool, focused on errors and downtime to something that supports the entire development cycle.

  • Modern engineering practices + good observability is where the real value is.

    • Modern engineering practices such as feature flags, progressive deployment, and canary releases, along with observability, give engineers confidence to move quickly and safely.

    • Observability acts as a translation layer, enabling engineers to reason about their work and tie it back to top-level business goals.

    • The dream goal? To be able to explain and understand our work in the same language as everyone else: how much financial value is this piece of code generating?

Why is observability hard, anyway?

  • Engineers have to think about what they might need to understand in the future. Like during an incident at 2:00 AM!

  • Software is hard. Observability is the first line of defense.

  • Tools have historically required engineers to be masters of multiple disciplines, e.g., they had to convert their code into physical resources such as CPU and RAM usage.

  • Cost of Observability: why is it so expensive?

  • One reason observability is expensive: the multiplier effect. The same data is stored multiple times. One common criticism of The Three Pillars model.

  • Cardinality: another thing that can make it a lot more expensive

    • Cardinality means to the number of unique items in a set. Unique IDs, such as request IDs, have the highest possible cardinality.

    • Big 'M' Metrics tools are designed to handle low-cardinality data (Observability 1.0 tools)

    • Adding high cardinality data to metrics tools makes them very expensive.

    • These days, world-class observability teams spend the majority of their time governing cardinality!

    • The more unique the data, the more valuable it is for debugging but that also means it costs more.

    • To solve this, the industry has to move away from tools backed by big 'M' metrics, to those using structured data where high cardinality can be stored.

    • The wider the logs (the more context attached to each event), the better the ability to identify outliers and correlate data.

  • Is Observability 1.0 getting in the way or building what engineering needs – at a lower cost?

    • The model for traditional observability tools does not fit the needs for the data that engineers actually need.

    • Metrics were optimized for a world where resources were very expensive, but now that storage and compute is cheaper, it's possible to store more data and slice and dice in realtime.

    • A column-based data store is needed to use flexible structured data without having to define indexes and schemas in advance.

OpenTelemetry

  • What is OpenTelemetry (OTel)?

    • A collection of APIs, SDKs and tools to make telemetry portable and effective.

    • It provides a framework for consistent telemetry with consistent naming and semantic conventions, allowing vendors to do more with the data.

    • OTel overtook Kubernetes as the number one project in the CNCF.

  • The goal of Otel is to allow engineers to instrument code once, and then point the data to whatever vendor is chosen.

  • OTel forces vendors to compete on the basis of their excellence and responsiveness.

  • Using OpenTelemetry is a safe bet for companies to enable portability of data between vendors.

  • It also gives the option of negotiating with vendors, because of the ability to switch!

Common mistakes with observability

  • Introducing it too late. Engineers feel like they don't need observability until they are in production and things start breaking.

  • Using dashboards wrong.

    • Engineers can get too attached to dashboards.

    • Dashboards, unless they are dynamic and allow you to ask questions, are a poor view into software.

  • Not using SLOs and error budgets enough.

    • SLOs (Service Level Objectives) should be the entry point, not dashboards.

    • SLOs are the APIs for engineering teams.

    • SLOs provide a budget for teams to run chaos engineering experiments.

    • SLOs are a hedge against micromanagement, because when teams meet their SLOs, the way they spend their time is not important.

    • SLOs allow teams to negotiate for reliability work if they are not meeting their obligations.

    • SLOs need to be derived from the same data as debugging.

Other topics

  • Why did Honeycomb build their own database?

    • At Honeycomb, Charity decided to build their own database despite the common wisdom to never do it. ClickHouse wasn't a thing back then: if it was, perhaps they would have not built it.

    • The database, called Retriever, is a column-based store. The query planner runs using Lambda jobs. Data is aged out to S3 after being written to SSDs.

    • It’s been a win, looking back now. The data model is custom, and being able to iterate on it has been a force multiplier.

  • Observability and AI

    • AI intersects with observability in three areas:

      • 1. When building and training models

      • 2. When developing with LLM

      • 3. When dealing with code of unknown origin produced by AI

    • Good AI observability can't exist in isolation; it must be embedded in good software observability.

    • The inputs for AI models come from different services, data and humans and this creates a trace shaped problem

  • Build vs Buy vs Open Source

    • The main trend across the industry: consolidation. Companies try to control their bills.

    • Most companies use vendors and don't want to deal with observability tools breaking at 2am.

    • Metrics still have a place, but most companies need to move from 80% metrics/20% structured data to the reverse.

  • Frontend and mobile observability

    • Silos are created when different teams use different tools.

    • A unified view from mobile/browser to database is powerful.

    • Mobile is different because the build pipeline is different, and the inability to fold mobile into software development best practices.

Resources & Mentions

Where to find Charity Majors:

• X: https://x.com/mipsytipsy

• LinkedIn: https://www.linkedin.com/in/charity-majors/

• Blog: https://charity.wtf/

Mentions during the episode:

• Honeycomb: https://www.honeycomb.io/

• Parse: https://parseplatform.org/

• Ruby on Rails: https://rubyonrails.org/

• Christine Yen on LinkedIn: https://www.linkedin.com/in/christineyen/

• Scuba: Diving into Data at Facebook: https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/

• Three pillars: https://charity.wtf/tag/three-pillars/

• Unified storage: https://charity.wtf/tag/unified-storage/

• “Every Sperm is Sacred”:

• Peter Borgan on LinkedIn: https://www.linkedin.com/in/peterborgan/

• Datadog: https://www.datadoghq.com/

• Vertica: https://en.wikipedia.org/wiki/Vertica

• Ben Hartshorne on LinkedIn: https://www.linkedin.com/in/benhartshorne/

• Cardinality: https://en.wikipedia.org/wiki/Cardinality_(data_modeling)

• COBOL: https://en.wikipedia.org/wiki/COBOL

• Ben Sigelman on LinkedIn: https://www.linkedin.com/in/bensigelman/

• OpenTelemetry: https://opentelemetry.io/

• Kubernetes: https://www.cncf.io/projects/kubernetes/

• SLOs: https://docs.honeycomb.io/notify/alert/slos/

• ClickHouse: https://clickhouse.com/

• Why We Built Our Own Distributed Column Store: https://www.honeycomb.io/resources/why-we-built-our-own-distributed-column-store

• "Why We Built Our Own Distributed Column Store" by Sam Stokes:

• "How we used serverless to speed up our servers" by Jessica Kerr and Ian Wilkes:

• Inside Figma’s Engineering Culture: https://newsletter.pragmaticengineer.com/p/inside-figmas-engineering-culture

• How to debug large, distributed systems: Antithesis: https://newsletter.pragmaticengineer.com/p/antithesis

• Observability in the Age of AI: https://www.honeycomb.io/blog/observability-age-of-ai

• Grafana: https://grafana.com/

• Prometheus: https://prometheus.io/

• What Is Real User Monitoring (RUM)?: https://www.honeycomb.io/getting-started/real-user-monitoring

• Crashlytics: https://en.wikipedia.org/wiki/Crashlytics

• Square wheels comic: https://alexewerlof.medium.com/on-reinventing-the-wheel-201148f74642

• WhistlePig Whiskey: https://www.whistlepigwhiskey.com/

• George T. Stagg bourbon: https://www.buffalotracedistillery.com/our-brands/stagg.html

• Stagg Jr.: https://newportwinespirits.com/products/stago-jr-ksbw

• Fluke: Chance, Chaos, and Why Everything We Do Matters: https://www.amazon.com/Fluke-Chance-Chaos-Everything-Matters/dp/1668006529

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@pragmaticengineer.com.

Discussion about this episode