A startup on hard mode: Oxide, Part 2. Software & Culture
Oxide is a hardware and a software startup, assembling hardware for their Cloud Computer, and building the software stack from the ground up. A deep dive into the company’s tech stack & culture.
👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. To get articles like this in your inbox, every week, subscribe:
Before we start: we are running research on bug management and “keep the lights on” (KTLO.) This is an area many engineering teams struggle with, and we’d love to hear what works for you, and your organization. You can share details here with us – with Gergely and Elin, that is. Thank you!
Hardware companies are usually considered startups on “hard mode” because hardware needs more capital and has lower margins than software, and this challenge is shown by the fact there are far fewer hardware startup success stories than software ones. And Oxide is not only building novel hardware – a new type of server named “the cloud computer” – but it’s also producing the software stack from scratch.
I visited the company’s headquarters in Emeryville (a few minutes by car across the Bay Bridge from San Francisco) to learn more about how Oxide operates, with cofounder and CTO Bryan Cantrill.
In Part 1 of this mini-series, we covered the hardware side of the business; building a networking switch, using “proto boards” to iterate quickly on hardware, the hardware manufacturing process, and related topics. Today, we wrap up with:
Evolution of “state-of-the-art” server-side computing. Mainframes were popular in the 1960s-70s, and since the 2000s, PC-like servers have taken over data centers, while hyperscalers like Google and Meta build their own custom server hardware.
Software stack. Built from the ground up with Rust, an open source operating system, debugger, and utilities. Also a hypervisor based on bhyve, Typescript, CockroachDB, and other technologies.
Compensation & benefits. Nearly everyone makes the same base salary of $201,227, except salespeople with incentives. It’s a rare compensation strategy that may not work forever, but does now!
Hiring process. A writing-heavy process that showcases how important effective writing and analysis are. Interestingly, everyone sees each other’s “work sample” packages.
Engineering culture. Remote-first, writing-heavy, RFDs, recorded meetings, no performance reviews, and more.
Software and hardware engineering collaboration. At most companies, software engineers have to accept that hardware is unchangeable, and hardware engineers accept the same about software. But when software and hardware engineers truly communicate, they realize neither is true, and they can change everything and anything, as Oxide has done so.
Impact of Apple and Sun. Apple is the best-known consumer tech company which makes its own hardware and software, while Sun was the last major server maker of this type. Bryan worked at Sun for 14 years and Oxide follows a similar playbook to that which made Sun successful in the 1990s.
As always, these deep dives into tech companies are fully independent, and I have no commercial affiliation with them. I choose businesses to cover based on interest from readers and software professionals, and also when it’s an interesting company or domain. If you have suggestions for interesting tech businesses to cover in the future, please share!
1. Evolution of “state-of-the-art” server-side computing
In Part 1, we looked at why Oxide is building a new type of server, and why now in 2024? After all, building and selling a large, relatively expensive cloud computer as big as a server rack seems a bit of a throwback to the bygone mainframe computing era.
The question is a good opportunity to look at how servers have evolved over 70 years. In a 2020 talk at Stanford University, Bryan gave an interesting overview. Excerpts below:
1961: IBM 709. This machine was one of the first to qualify as a “mainframe,” as it was large enough to run time-shared computing. It was a vacuum tube computer, weighed 33,000 pounds (15 tonnes,) and occupied 1,900 square feet (180 sqm,) consuming 205 KW. Today, a full rack consumes around 10-25 KW. Add to this the required air conditioning, which was an additional 50% in weight, space and energy usage!
1975: PDP 11-70. Machines were getting smaller and more powerful.
1999: Sun E10K. Many websites used Sun servers in the late 1990s, when the E10K looked state-of-the-art. eBay famously started off with a 2-processor Sun machine, eventually using a 64-processor, 64GB Sun E10K version to operate the website.
2009: x86 machines. In a decade, Intel x86, Intel’s processor family won the server battle with value for money; offering the same amount of compute for a fraction of the price of vendors like Sun. Around 2009, HP’s DL380 was a common choice.
Initially, x86 servers had display ports CD-ROM drives, which was odd on a server. The reason was that it was architecturally a personal computer, despite being rack-mounted. They were popular for the standout price-for-performance of the x86 processor.
2009: hyperscale computing begins at Google. Tech giants believed they could have better servers by custom-building their own server architecture from scratch, instead of using what was effectively a PC.
Google aimed to build the cheapest-possible server for its needs, and optimized all parts of the early design for this. This server got rid of unneeded things like the CD-drive and several ports, leaving a motherboard, CPUs, memory, hard drives, and a power unit. Google kept iterating on the design.
2017: hyperscale computing accelerates. It wasn’t just Google that found vendors on the market didn’t cater for increasingly large computing needs. Other large tech companies decided to design their own servers for their data centers, including Facebook:
By then, hyperscale compute had evolved into compute sleds with no integrated power supply. Instead, they plugged into a DC bus bar. Most hyperscalers realized that optimizing power consumption was crucial for building efficient, large-scale compute. Bryan says:
“When your goal is to improve your power usage effectiveness, you want to be as efficient as possible and have all of your power go to your computing, and as little as possible to heating the room.”
2020: server-side computing still resembles 2009. As hyperscale computing went through a major evolution in a decade, a popular server in 2020 was the HPE DL560:
It remains a PC design and ships with a DVD drive and display ports. Bryan’s observation is that most companies lack the “infrastructure privilege” to use their custom-built solutions, unlike hyperscalers such as Google and Meta which greatly innovated in server-side efficiency.
Why has there been no innovation in modernizing the server, so that companies can buy an improved server for large-scale use cases? Bryan says:
“Actually, there have been many, many attempts at innovating hardware within the cloud. Attempts occurred at established companies, like Intel’s attempt with the Intel Rack Scale Design (2017) or HP’s HPE Moonshot (2013). Startups like Nebula (2011-2015) and Skyport (2013-2018, acquired by Cisco) also tried to solve this problem.
Each attempt fell short for its own reasons, but the common theme I see is that they were either insufficiently ambitious, or insufficiently comprehensive – and sometimes both.
Solving the problem of building a new type of cloud computing building block requires both hardware and software, and they must be co-designed. For established players, doing this is simply too disruptive. They would rather reuse their existing hardware and software stack! And for startups, it is too capital-intensive, as the cost of building both hardware and software from scratch is just too large.”
2. Software stack
Rust is Oxide’s language of choice for the operating system and backend services. Software engineer Steve Klabnik was previously on Rust’s core team, and joined Oxide as one of the first software engineers. On the DevTools.fm podcast, he outlined reasons why a developer would choose Rust for systems programming, over C or C++:
“Rust allows you to do everything C and C++ does, but it helps you do those tasks significantly more. If you're doing low-level work, you have to use very sharp tools and sharp tools can sometimes cut you. And there's like a weird balance there.
Additionally this low-level space hasn't really seen a lot of new programming languages in a long time. So these other languages tend to be much more old-school – therefore harder to use – if you weren’t doing them since the 90s.
Rust brings a 2010s-era development experience to a space that is pretty solidly stuck in the 70s and 80s. There’s a lot of people who don’t really care about the Rust versus C++ language, but that there is a developer experience that is more familiar to them, makes Rust worthwhile.”
Interestingly, going all-in on Rust greatly helped with hiring, Bryan reveals; probably as a combination of the Rust community being relatively small, and Oxide being open about its commitment to it. Initially, it was more challenging to find qualified hardware engineers than software engineers; perhaps because software engineers into Rust heard about Oxide.
Open source is a deliberate strategy for Oxide’s builds and software releases, and another differentiator from other hardware vendors who ship custom hardware with custom, closed source software.
The embedded operating system running on the microcontrollers in Oxide’s hardware is called Hubris. (Note that this is not the operating system running on the AMD CPUs: that operating system is Helios, as discussed below.) Hubris is all-Rust and open source. Characteristics:
Microkernel-based: it uses the near-minimum amount of software to implement an operating system.
A memory-protected system: tasks, the kernel, and drivers, all in disjoint protection domains. Separation is important, even when using a memory-safe language like Rust.
A highly debuggable operating system, thanks to a dedicated debugger called Humility.
Static for application execution and application payload. Many operating systems create tasks dynamically, at runtime. but Hubris was designed to specify tasks for a given application in build time. Bryan says:
“This is the best of both worlds: it is at once dynamic and general purpose with respect to what the system can run, but also entirely static in terms of the binary payload of a particular application — and broadly static in terms of its execution. Dynamic resource exhaustion is the root of many problems in embedded systems; having the system know a priori all of the tasks that it will ever see, liberates it from not just a major source of dynamic allocation, but also from the concomitant failure modes.”
If you want to run Hubris on actual hardware and debug it with Humility, you can by ordering a board that costs around $30: the ST Nucleo-H753ZI evaluation board is suitable:
The hypervisor. A hypervisor is important software in cloud computing. Also known as a “Virtual Machine Monitor (VMM),” the hypervisor creates and runs virtual machines on top of physical machines. Server hardware is usually powerful enough to warrant dividing one physical server into multiple virtual machines, or at least being able to do this.
Oxide uses a hypervisor solution built on the open source bhyve, which is itself built into illumos, a Unix operating system. Oxide maintains its own illumos distribution called Helios and builds its own, Rust-based VMM userspace, called Propolis. Oxide shares more about the hypervisor’s capabilities in online documentation.
Oxide has also open sourced many other pieces of software purpose-built for their own stack, or neat tools:
Omicron: Oxide’s rack control plane. Read more about its architecture.
Crucible: Oxide’s distributed storage service
Bootleby: a minimal, general bootloader
Design-system: base frontend components used across Oxide clients
OPTE: Oxide’s packet transformation engine
Dropshot: exposing REST APIs from a Rust application
Typify: a JSON to Rust schema compiler
Console: the Oxide web console
…and many others!
Other technologies Oxide uses:
Typescript: the language of choice for everything frontend. The Oxide web console, design assets, and RFD site, use this language.
CockroachDB: the distributed database used for the control plane data storage system.
ClickHouse: the open source column-oriented database management system used to collect and store telemetry data for the Oxide Rack.
Tailwind CSS: the utility-first CSS framework to specify styles using markup, is used on websites built by the Oxide team.
Terraform: Oxide’s requests for discussion site uses Terraform to describe its underlying infrastructure using infrastructure-as-code, to specify the Google Cloud zone this site runs from. This is more of an internal infrastructure choice – and a rather simple one – but I find it interesting.
Figma: used for design mockups, and Oxide’s design system library syncs with Figma. Check out a deep dive into Figma’s engineering culture.
3. Compensation & benefits
Oxide chose a radically different compensation approach from most companies, with almost everyone earning an identical base salary of $201,227. The only exception is some salespeople on a lower base salary, but with commission.
How did this unusual setup emerge? Bryan shares that the founders brainstormed to find an equitable compensation approach which worked across different geographies. Ultimately, it came down to simplicity, he says:
“We decided to do something outlandishly simple. Take the salary that Steve, Jess, and I were going to pay ourselves, and pay that to everyone. The three of us live in the San Francisco Bay Area, and Steve and I each have three kids; we knew that the dollar figure that would allow us to live without financial distress – which we put at $175,000 a year – would be at least universally adequate for the team we wanted to build. And we mean everyone: as of this writing we have 23 employees, and that’s what we all make.”
This unusual approach supports company values:
Teamwork: “The need to quantify performance in order to justify changes to compensation is at the root of much of what’s wrong in the tech industry; instead of incentivizing people to achieve together as a team, they are incentivized to advance themselves.”
Equitability: the founders treat people as they wish to be treated, and identical salaries mean no negotiations.
Transparency: Colleagues know how much each other earn, so a potentially tricky topic is neutered.
The company updates the base salary annually to track inflation: in 2024, everyone makes $201,227. Bryan acknowledged this model may not scale if Oxide employs large numbers of people in the future, but he hopes the spirit of this comp approach would remain.
Other benefits. Oxide offers benefits on top of salary – mostly health insurance; very important in the US:
Medical, dental and vision insurance in the US. 100% paid for employees and dependents.
Optional FSA plan for out-of-pocket healthcare and dependent care expenses.
Reimbursing up to $17,000 annually for various surgery expenses
Retirement plan (401K)
Medical coverage for non-US remote folks
In another example of transparency, the policy documentation for these benefits was made public in 2022 in a blog post by systems engineer, iliana etaoin.
4. Heavyweight hiring process
Oxide’s hiring process is unlike anything I’ve seen, and we discussed it with the team during a podcast recording at their office.