Peek inside Netflix’s engineering culture with CTO Elizabeth Stone, as she shares how the company has no formal performance reviews, learns from failures, and builds at a global scale.
I found this topic really interesting being on the infra / DevOps side but I think there's a fine line between "Netflix has no process" and "Netflix is so relaxed".
The live stories (Paul/Tyson, NFL, WWE) read more to me like "operational discipline using cultural language", rather than "complete autonomy". A 40-50 page if/then launch document, hardline laptops, VPN backup, and "informed captain" (ad-hoc war room), and a whole new dashboard set created specifically for that night... that is NOT "no process," that is SRE/command center under another name. They simply allowed their senior engineers to create it themselves from the bottom up rather than requiring a one size fits all playbook developed by a centralized group. I think most companies will adopt the "we do not have formal performance reviews" aspect and completely miss that you need extremely high talent density and extremely opinionated operational preparation to implement such an environment.
Same thing with "local engineer judgment" and "only a few global rules." This can only work because eventually they did implement tiering, quiet periods, and clear expectations for Tier 0 & Tier 1 systems when live started to show up. In other words, the more business critical the path, the more structure will appear. On the outside it seems like complete freedom; in the ops world it seems like very carefully segmented risk with a cultural wrapper that keeps people from calling it "process." I am sure that is the correct trade-off for them but extremely difficult to transfer to a mid-talent, highly regulated environment without it imploding.
On the AI side, I actually find it refreshing that they seem to be conservative about AI. The current industry narrative is "AI code generation agent for every problem". Netflix is saying "use Gen-AI for the places it can make a huge impact (prototype, migration, anomaly detection, documentation)", "don't pretend its a silver bullet", and "don't recreate something the market already does well unless you are truly creating a differentiator". I think that is a far more boring approach to AI than "AGI will write all our services", but from a DevOps standpoint, boring + observable + clear ownership typically scales much better than "vibes".
I understand that you have to do sponsored posts and your questions had to be sent in advance and aligned with multiple stakeholders... but man this interview lacks life. Don't take me wrong the CTO of Netflix is a wonderful speaker. I like how she's conveying technical concepts in a clear way but these concepts are very basic and her narrative is so refined like there's an invisible lawyer standing behind her back and giving thumbs up every time she talks. "We took learnings", "team is so empowered they messaged me the next day after Tyson Paul before I woke up".
Omg, given how big was the miss it's no wonder they did, why not to talk about what exactly went wrong.
Please give us speakers who are comfortable going into technical details.
Take your "learnings", Gergely and thanks for your work :)
I kinda get your point, but I believe that you're making an assumption about "Gergely pulling his punches."
You've essentially asked the CTO of a publicly traded company to perform a live autopsy and state specifically what failed on a semi-sponsorship of a recorded podcast. In my opinion, there is no way that the Legal Department, Public Relations Department, and NFL/WWE Partners would ever agree to that. I believe that she's going through a "safe" story line, but still provided a lot more information than many executive officers typically provide (roughly what concurrencies existed, how crazy the Paul/Tyson were, the 40-50 page if/then document, Tier 0-1 thinking, how they needed to put guard rails around the operation after the first live event, etc.). While the technical depth isn't particularly deep, it shows how they think about risks when operating at scale.
If you truly desire "this is the specific failure mode," "this is the graph," "this is the deployment schedule," I'd argue those are types of talks that typically come from a member of engineering and/or operations staff providing a presentation at SRECon or posting on the Netflix Tech Blog. I'd enjoy that type of information, but I don't believe it is reasonable to expect that type of candidness from their CTO in this setting. From my perspective, I found the best part of the interview to be reading between the lines to determine how much operational discipline and process creep back into a company that positions itself as having a high degree of autonomy and very few rules.
Regarding AI, I also appreciate the fact that they seem to be quite conservative in their approach. The current industry trend is to assume "there will be AI coding agents everywhere for everything." Netflix seems to be stating: Use genAI wherever it makes sense as a multiplier (i.e., prototyping, migrations, anomaly detection, and documentation), don't overestimate its capabilities, and don't rebuild something the market has been doing well for years, unless there is a unique differentiator. While that is far less exciting than "AGI will be writing all of our services," I personally believe that boring, coupled with the ability to monitor and observe what is happening, along with clear ownership, tend to scale a lot better than "vibes".
I found this topic really interesting being on the infra / DevOps side but I think there's a fine line between "Netflix has no process" and "Netflix is so relaxed".
The live stories (Paul/Tyson, NFL, WWE) read more to me like "operational discipline using cultural language", rather than "complete autonomy". A 40-50 page if/then launch document, hardline laptops, VPN backup, and "informed captain" (ad-hoc war room), and a whole new dashboard set created specifically for that night... that is NOT "no process," that is SRE/command center under another name. They simply allowed their senior engineers to create it themselves from the bottom up rather than requiring a one size fits all playbook developed by a centralized group. I think most companies will adopt the "we do not have formal performance reviews" aspect and completely miss that you need extremely high talent density and extremely opinionated operational preparation to implement such an environment.
Same thing with "local engineer judgment" and "only a few global rules." This can only work because eventually they did implement tiering, quiet periods, and clear expectations for Tier 0 & Tier 1 systems when live started to show up. In other words, the more business critical the path, the more structure will appear. On the outside it seems like complete freedom; in the ops world it seems like very carefully segmented risk with a cultural wrapper that keeps people from calling it "process." I am sure that is the correct trade-off for them but extremely difficult to transfer to a mid-talent, highly regulated environment without it imploding.
On the AI side, I actually find it refreshing that they seem to be conservative about AI. The current industry narrative is "AI code generation agent for every problem". Netflix is saying "use Gen-AI for the places it can make a huge impact (prototype, migration, anomaly detection, documentation)", "don't pretend its a silver bullet", and "don't recreate something the market already does well unless you are truly creating a differentiator". I think that is a far more boring approach to AI than "AGI will write all our services", but from a DevOps standpoint, boring + observable + clear ownership typically scales much better than "vibes".
I understand that you have to do sponsored posts and your questions had to be sent in advance and aligned with multiple stakeholders... but man this interview lacks life. Don't take me wrong the CTO of Netflix is a wonderful speaker. I like how she's conveying technical concepts in a clear way but these concepts are very basic and her narrative is so refined like there's an invisible lawyer standing behind her back and giving thumbs up every time she talks. "We took learnings", "team is so empowered they messaged me the next day after Tyson Paul before I woke up".
Omg, given how big was the miss it's no wonder they did, why not to talk about what exactly went wrong.
Please give us speakers who are comfortable going into technical details.
Take your "learnings", Gergely and thanks for your work :)
I kinda get your point, but I believe that you're making an assumption about "Gergely pulling his punches."
You've essentially asked the CTO of a publicly traded company to perform a live autopsy and state specifically what failed on a semi-sponsorship of a recorded podcast. In my opinion, there is no way that the Legal Department, Public Relations Department, and NFL/WWE Partners would ever agree to that. I believe that she's going through a "safe" story line, but still provided a lot more information than many executive officers typically provide (roughly what concurrencies existed, how crazy the Paul/Tyson were, the 40-50 page if/then document, Tier 0-1 thinking, how they needed to put guard rails around the operation after the first live event, etc.). While the technical depth isn't particularly deep, it shows how they think about risks when operating at scale.
If you truly desire "this is the specific failure mode," "this is the graph," "this is the deployment schedule," I'd argue those are types of talks that typically come from a member of engineering and/or operations staff providing a presentation at SRECon or posting on the Netflix Tech Blog. I'd enjoy that type of information, but I don't believe it is reasonable to expect that type of candidness from their CTO in this setting. From my perspective, I found the best part of the interview to be reading between the lines to determine how much operational discipline and process creep back into a company that positions itself as having a high degree of autonomy and very few rules.
Regarding AI, I also appreciate the fact that they seem to be quite conservative in their approach. The current industry trend is to assume "there will be AI coding agents everywhere for everything." Netflix seems to be stating: Use genAI wherever it makes sense as a multiplier (i.e., prototyping, migrations, anomaly detection, and documentation), don't overestimate its capabilities, and don't rebuild something the market has been doing well for years, unless there is a unique differentiator. While that is far less exciting than "AGI will be writing all of our services," I personally believe that boring, coupled with the ability to monitor and observe what is happening, along with clear ownership, tend to scale a lot better than "vibes".