Startups on hard mode: Oxide. Part 1: Hardware
What is tougher than building a software-only or hardware-only startup? Building a combined hardware and software startup. This is what Oxide is doing, as they build a “cloud computer.” A deepdive.
👋 Hi, this is Gergely with a subscriber-only issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and startups through the lens of engineering managers and senior engineers. To get articles like this in your inbox, every week, subscribe:
What does an early-stage startup look like? Usually, there’s an office with devs working on laptops, a whiteboard with ideas, and lots of writing of code. Sometimes, there’s no physical office because team members work remotely, sharing ideas on virtual whiteboards.
But what about companies which aren’t “pure” software startups, and which focus on hardware? I recently got a glimpse of one in San Francisco, at the offices of Oxide Computer Company, which is a startup building a new type of server. I was blown away by the working environment and the energy radiating from it. This is their office:
Some things are hard to describe without experiencing them, and this includes being at a hardware+software startup right as the first product is being finished, with the team already iterating on it. In today’s issue, we cover hardware at Oxide:
Why build a new type of cloud computer?
Building a networking switch from scratch
Using “proto boards” to build faster
A remote-first hardware company
Custom hardware manufacturing process
The importance of electrical engineers (EE)
Working closely with hardware vendors
In Part 2 of this mini-series, we wrap up with:
Evolution of “state-of-the-art” server-side computing
Software stack
Compensation & benefits
Hiring process
Engineering culture
Software and hardware engineering collaboration
Impact of Apple and Sun
1. Why build a new type of cloud computer?
If you want to build an application or service for millions of users, there are two main options for the infrastructure:
Cloud provider. AWS, GCP, Azure, Oracle, etc. Elsewhere, Uber is making the move from on-prem to GCP and Oracle, as previously covered.
On-premises (prem.) Operate your own servers, or more commonly rent rack space at a data center for them. This approach is sometimes called a “private cloud.” We’ve covered how and why booking platform Agoda is staying on-prem and how its private cloud is built. Social media site Bluesky also uses its own data centers since leaving AWS.
In data centers, the unit of measurement is “one rack.” A rack is a storage unit that can hold a few dozen servers; often referred to as “pizza box servers” because of their shape. Thicker types are called “rack servers.” Side note: an alternative to pizza box servers is blade servers, inserted into blade enclosures as building blocks in data centers.
Here’s a “pizza box server” that online travel booking platform Agoda utilized heavily:
And here’s some commodity servers in Oxide’s office:
The rack in the image above was operating during my visit. It was loud and generated a lot of heat as expected, and there were lots of cables. It’s messy to look at and also to operate: the proprietary PC-era firmware causes security, reliability and performance issues. “PC-era” refers to the 1980s – early-2000s period, before x86 64-bit machines became the servers of choice.
Elsewhere, Big Tech companies have manufactured their own highly optimized racks and servers, but these aren’t for sale. The likes of Meta, Google, and Amazon no longer use traditional racks, and have “hyper-scaled” their servers to be highly energy efficient, easier to maintain, and with few cables.
Joe Kava, VP of Google's Data Center Operations, described these racks back in 2017:
“Many of our racks don’t look like traditional server racks. They are custom designed and built for Google, so we can optimize the servers for hyper-efficiency and high-performance computing.”
Back to Oxide, whose vision is to build a cloud computer that incorporates the technological advances of Big Tech’s cloud racks, but makes them available to all. What if smaller tech companies could purchase energy-efficient servers like those that Meta, Amazon, Google and Microsoft have designed for themselves, and which customers of the big cloud providers like AWS, GCP, and Azure use – but without being locked in?
This is what the Oxide Computer offers, and I’ve seen one of its first racks. It appears similar in size to a traditional rack, but the company says it actually occupies 33% less space than a traditional rack, while offering the same performance. It’s much quieter than an everyday commodity server; in comparison the Gigabyte and Tyan servers are ear-splitting, and there are hardly any cables compared to a typical server.
The benefits of the Oxide computer compared to traditional racks:
Faster installation: installing a traditional rack typically takes weeks or months, says Bryan, because the servers need to be put in, wired up, and then tested. The Oxide rack comes fully assembled; it just needs to be slotted in at a data center.
Space and power efficiency. Uses less power and occupies less space. Less noise indicates superior power efficiency, with fans not needing to work as hard because the heat is channeled better.
Comes with integrated software to manage elastic infrastructure. With traditional rackmounted servers, it’s necessary to select software to manage virtual machines, like VMware, Metal as a Service, Proxmox Virtual Environment, and OpenStack Ironic. The Oxide cloud computer includes built-in virtualization for storage (an equivalent of AWS’s Elastic Block Store) and also networking (an alternative to virtual private clouds.)
Oxide’s target customer is anyone running large-scale infrastructure on-prem for regulatory, security, latency, or economic reasons. The Oxide rack comes with 2,048 CPU cores (64 cores per “sled,” where one sled is Oxide’s version of a “rackmount server”,) 16-32TB of memory (512GB or 1TB of memory per sled) and 1PB (petabyte) of storage (32TB storage per sled). See full specification.
This kind of setup makes sense for companies that already operate thousands of CPU cores. For example, we previously covered how Agoda operated 300,000 CPU cores in its data centers in 2023; at such scale investing in racks like Oxide’s could make sense. Companies in the business of selling virtual machines as a service might also find this rack an interesting investment to save money on operations, compared to traditional racks.
An interesting type of customer are companies running thousands of CPU cores in the public cloud, but which are frustrated by network latencies. There’s a growing sense that multi-tenancy in public clouds; where one networking switch serves several racks and customers, causes worse latency which cannot be debugged or improved. In contrast, an Oxide rack offers dedicated rack space in data centers. Using these servers can also considerably reduce network latencies because the customer can choose the data center they use, based on their own regional needs. Customers also get full control over their networking and hardware stack – something not possible to do when using a cloud provider.
Oxide doesn’t target smaller startups that only need a few hundred CPU cores. For these businesses, using cloud providers, or buying/renting and operating smaller bare metal servers is the sensible solution.
2. Building a networking switch from scratch
In server manufacturing, where does innovation come from? I asked Bryan:
“Companies like Google, Meta and similar companies producing their custom hardware and software to build better servers, could bring competition to the market. However, it’s highly unlikely that these companies would release their servers as a commercial product. It’s not their business model.
So, no, the next big server innovation will not come from Google or a similar company. It will come from a startup. And we want to be that startup.”
Oxide had to design two pieces of hardware from scratch: the switch and the server.
Why build a switch instead of integrating a third-party switch?
Oxide’s mission is to build their own cloud computer. Building a custom server usually means taking off-the-shelf components for a system and integrating it all together, including the server chassis, a reference design system board, and a separately-developed network switch. A “reference design” is a blueprint for a system containing comprehensive guidance on where to place its elements, that’s been certified to work as intended: it should not overheat, or cause unexpected interference.
However, Oxide also needed to build their own networking switch, as well as build a custom server – which is quite an undertaking! This need came from the constraint that Oxide wanted to control the entire hardware stack, end-to-end. A networking switch is a “mini-computer” in itself. So in practice, they designed and built two computers, not just one.
Producing a separate switch meant integrating a switching application-specific integrated circuit (ASIC), management CPU, power supplies, and physical network ports.
Oxide’s goals for this switch were:
Highly available operation. Each Oxide rack has two networking switches which operate simultaneously, as per high availability. If links to one switch have issues, or a switch needs to be serviced, then the servers can still access networks via the other switch, ensuring more reliable operation than with a single-switch setup.
Integrated with control plane software. The single most important factor in Oxide’s decision was the desire to deliver a quality end-to-end experience for multi-tenant elastic infrastructure. The team knew from their experience of deploying public cloud infrastructure that the switch is often a nexus of reliability and performance issues.
Use the same “core” hardware as the server. The switch must use the same regulators and power controller as the Oxide servers.
Building the custom networking switch took around 2 years, from designing it in March 2020, to the first unit being assembled in January 2022.
Building custom hardware almost always comes with unexpected challenges. In the case of the networking switch, the team had to work around an incorrect voltage regulator on the board, marked with yellow tape in the image above.
3. Using proto boards to build hardware faster
Proto boards is short for “prototype printed circuit boards,” which help the company test small components to ensure they work independently. Once validated, those components can be used as building blocks.
“When we set out to build a server from scratch, we didn’t want to go straight to building the server motherboard. However, when we started we had zero full-time electrical engineers!
There’s a process I learned from robotics people in a previous job, called a ‘roadkill build.’ You get all the parts that will end up in the thing you eventually build, but instead of being integrated, they are all spread out across a bench with nasty cables between them, so that you can probe them, poke them, and replace them. We thought it would be a good idea to do this for the servers.”
First prototype board
For the initial prototype printed circuit board, the team started with the service processor. This was a well understood, very critical part of the server and the switch. The team decided to build it from two off-the shelf microcontrollers, and built a prototype board around this:
Starting out, the company took inspiration from robotics. Founding engineer, Cliff L. Biffle, shares:
The team found that It’s possible to bite off more than you can chew, even with prototype circuit boards, The first board was a success in that it worked: upon start up, the hardware and software “came up” (in electrical engineering, “coming up” refers to the successful powering up and initializing of an electronic system board.) But the time it took to develop was longer than the team wanted.
It turns out this prototype board was too highly integrated, with too many moving parts and too many I/O pins. There was simply too much on the board for it to be productive. The team learned that progress would be faster with multiple, simpler boards as pluggable hardware modules, instead of one complicated board with lots of functionality and many fixed functions. As engineer Rick Altherr – who worked on the board – noted:
“We put too many things on. The concept we ran into was that an x86 server, with all of its management, is way too complicated. Let’s slim it down to just the management subsystems. This board is intended to be the service processor. But it turns out that even that’s too much.
By having so many things be on one board, instead of pluggable modules, it meant that we committed to a lot of design choices. For example, the two ethernet jacks were never actually used because we changed our philosophy on how that was going to work, before we got the boards back from manufacturing.”
A more modular approach
Before building the service processor board, the team separated out the “root of trust” (RoT) functionality onto a separate board. The RoT hardware is the foundational base upon which the security and trust of the entire system are built. A RoT has “first instruction integrity,” guaranteeing exactly which instructions run upon startup. The RoT sets up the secure boot and permanently locks the device to ensure ongoing secure operation. Below is the prototype of Oxide’s RoT module:
Other modules the Oxide team built included a power module, a multiplexer (a device with multiple input signals, which selects which signal to send to the output):
Over time, the team found the right balance of how much functionality the prototype board needed. Below is an evolved prototype board version of the service processor:
The Oxide team calls this board a “workhorse” because they can plug in so many modules, and do so much testing and hardware and software development with it. Here’s an example:
A prototype board is unit testing for hardware. In software development, unit tests ensure that components continue to work correctly as the system is modified. Oxide found that prototype boards come pretty close to this approach, and allowed Oxide to iterate much faster on hardware design, than by manufacturing and validating test devices.