What is a Principal Engineer at Amazon? With Steve Huynh

The Pragmatic Engineer

0:00

-1:13:16

What is a Principal Engineer at Amazon? With Steve Huynh

Former Amazon Principal Engineer Steve Huynh shares what it takes to reach the Principal level, why the jump Principal is so tough at Amazon, and how Amazon’s scale and culture shaped his career.

Gergely Orosz

Jul 09, 2025

119

Transcript

Stream the Latest Episode

Listen and watch now on YouTube, Spotify and Apple. See the episode transcript at the top of this page, and timestamps for the episode at the bottom.

Brought to You By

Statsig ⁠ — ⁠ The unified platform for flags, analytics, experiments, and more.
Graphite — The AI developer productivity platform.
Augment Code — AI coding assistant that pro engineering teams love.

—

In This Episode

Steve Huynh spent 17 years at Amazon, including four as a Principal Engineer. While in Seattle, I stopped by at Steve in his studio to record this episode of The Pragmatic Engineer. We went into what the Principal role involves at Amazon, why the path from Senior to Principal is so tough, and how even strong engineers can get stuck. Not because they’re unqualified, but because the bar is exceptionally high.

We discuss what’s expected at the Principal level, the kind of work that matters most, and the trade-offs that come with the title. Steve also shares how Amazon’s internal policies shaped his trajectory, and what made the Principal Engineer community one of the most rewarding parts of his time at the company.

We also go into:

Why being promoted from Senior to Principal at Amazon is one of the hardest jumps in tech
How Amazon’s freedom of movement policy helped Steve work across multiple teams, from Kindle to Prime Video
The scale of Amazon: handling 10k–100k+ requests per second and what that means for engineering
Why latency became a company-wide obsession at Amazon —and the research that tied it directly to revenue
Why companies should start with a monolith, and what led Amazon to adopt microservices
What makes the Principal Engineering community so special
Amazon’s culture of learning from its mistakes, including COEs (correction of errors)
The pros and cons of the Principal Engineer role
What Steve loves about the leadership principles at Amazon
Amazon’s intense writing culture and 6-pager format
Why Amazon patents software and what that process looks like
… and much more!

An interesting topic: "brownouts” at Amazon

“Brownout” is internal Amazon lingo. At Amazon’s scale, service failures are frequent, and cascading failures can happen if dumping load onto services in a “brownout” state. Steve explained what this means, and why it was important at the e-commerce giant:

Gergely (at 11:56): What does “brownout” mean?
Steve: I'm using some jargon. Suppose you are DDoS’ing a service or sending a lot of requests over to them: you can just take them down! That would be a blackout. With a blackout: yo you send a request, you can't establish a connection, it immediately comes back as failed.
But there's a type of outage where they ‘brown out’. So the service reachable, they might accept the connection, but they'll time out or they might return partial results or bad results. Or perhaps the only thing that they do return is a 500 for some percentage of requests.
So now we start talking about availability and resilience in the face of all of this DDoSing that you're doing to yourself. Let’s say your service is a dependency of some of the process that's going on.
If there's a failure for a primary dependency and that dependency comes back up: how do you make sure you don't just inundate it with a bunch of requests as it's trying to recover? And so now you have all of these sort of odd dynamics that occur. I used a brownout as something that is recurring problem. There might be some increased latency that may cause a chain reaction of a dependency going down. And then one of these sort of middle tier services would brown out. So you're an owner of the services for your team. And so then it's like, okay, what do we do in those situations?
How do we know that they're browning out? What do we do in the face of a dependency outage? And then critically, if there is an outage and then the service comes back up:
How do we make sure that we give it enough space so that [the service] can ‘breathe’? So that as they're trying to recover from some sort of outage, we don't just take them down immediately again.

What Steve describes reminded me of what the Cursor engineering team described as the “Cold start problem at scale” in the deepdive How Cursor is built:

An unappreciated challenge is how hard it is to do a “cold start” for a massive service. As Sualeh [Cursor cofounder] explains:
“Imagine you’re doing 100,000 requests per second and suddenly, all your nodes die. When restarting your system, your nodes usually go up one after the other. Say you’ve managed to restart 10 nodes from a fleet of 1,000. If you don’t prohibit people from making requests, these 10 nodes will get smashed by all the incoming requests. Before these 10 nodes could have become healthy, you’ve just overloaded those nodes!
This has bitten us many times in the past. Whenever you have a bad incident that needs a cold start, you need to figure out how to do it well.
Many of the large providers you probably use have various ‘tricks’ to kill traffic while they perform a cold start. We ended up doing a setup where we either fully halt traffic until the cold start is complete, or prioritize a small subset of our users during a cold start, until the service is back at being healthy.”