Engineering Planning with RFCs, Design Documents and ADRs
What are some successful planning approaches engineering teams use as they grow?
Q: As our engineering team grows, we feel the need to do more written planning. Which approaches do tech companies use and how do they work?
The question of whether to document engineering planning is an evergreen one. This issue walks through examples of what certain tech companies do and it attempts to showcase some popular approaches. The newsletter closes with advice on how to decide which formats to choose.
Topics:
Uber’s evolution of planning processes
RFCs, Design Documents
Reviewing RFCs
Architecture Documents
Sourcegraph and RFCs
Stedi: RFCs and Decision Records
Design Docs at Google
Examples of RFCs, and companies that use an RFC-like process. See these in a separate article here.
For the rest of this article, I use the term RFC (Request for Comment) to refer to any type of engineering design document, for simplicity.
For those arriving at this article from The Software Engineer’s Guidebook, please see this full article on RFCs, ERDs and Design Docs. It’s the article I meant to link - sorry for the confusion!
1. Uber’s evolution of planning processes
Uber is a good example of how engineering planning can evolve as a company grows from a few engineers, through to a few hundred, to well over 2,000 software engineers, in less than ten years.
The company has managed to keep an engineering culture in which it can still ship in a reasonably quick and nimble way, even with thousands of engineers. This is key for the company, as the regulatory environment of the gig economy changes frequently, as does the competitive landscape; meaning players in this field need to respond fast, often in a matter of weeks or months. For example, in 2017 Uber shipped tipping functionality in around two months, in response to competitor Lyft launching this feature. This change touched dozens of systems and involved well over twenty engineering teams.
Uber was founded in 2010 and made its first full-time engineering hire in 2011.
Early on, in around 2012-13, Uber engineers decided to document new services, when the company had less than 50 engineers. This decision came from the engineering team, it wasn’t mandated from above.
An engineer created and distributed a Google document called DUCK, referring to the “rubber ducking” of their thinking; using simple language to describe proposals, challenges or problems. The approach caught on, and soon the team proposed all new services should have a summary document called a DUCK. In the internal wiki, this page was created:
A few months later, the philosophy of this design document was written down and shared, as this:
DUCK is the new standard format for our service proposal process. The acronym doesn’t stand for anything.
When starting a new service, it’s important to document and vet its architecture. We use a peer review process designed to:
Institute a strong, org-wide practice of information sharing
Provide transparency and good cross-team communication about upcoming systems
Provide historical documentation for our design motivations
Catch potential architectural stumbling blocks before they get to production
Ensure that sufficient thought and time are given to resource allocation before the service needs to go to production
As context, Uber invested heavily in microservices and teams were encouraged to build their own microservices. By 2016, the company had more than 1,000 microservices. However, with every new service came dependencies; either upon other services, or other services which depended on these new systems.
DUCKs were sent to all engineers subscribed to a mailing list dedicated to DUCKs. Initially, relatively few such documents were sent, and these DUCKs helped teams discover dependencies – often before starting to code. As Uber’s engineering team grew to hundreds of engineers, the approach of using DUCKs – designed initially for new services – started to show cracks, with non-backend teams also looking to use this format.
Uber evolved the DUCK format into the RFC, a Request for Comment document. Engineering groups like Backend, Web and Mobile started to create segmented mailing lists to which these documents could be sent, to reduce overall noise as their quantity increased.
The process was also tweaked so that more complex proposals had “approver” fields. This was added to make sure key people read the document and could “sign” if they were happy or add objections if not.
Templates were created to make it easier for engineers to remember key information: for example, service SLAs for services, third party libraries for mobile applications, and so on.
Here are two example templates with suggestions the company used, at one time:
Services:
List of approvers
Abstract (what is the project about?)
Architecture changes
Service SLAs
Service dependencies
Load & performance testing
Multi data-center concerns
Security considerations
Testing & rollout
Metrics & monitoring
Customer support considerations
Mobile:
Abstract (what is the project about?)
UI & UX
Architecture changes
Network interactions detailed
Library dependencies
Security concerns
Testing & rollout
Analytics
Customer support considerations
Accessibility
As teams grew and learned from failures, some groups added extra check points to avoid past mistakes. For example, when changing the functionality of Payments, regional and legal considerations needed to be checked. When touching systems storing data, GDPR needed to be considered after the regulation was introduced.
A common question was when should an RFC be written? My take on the question was this:
For small changes: don’t bother. Just make the change.
For changes that are non-trivial and have dependencies: consider writing one.
The effort to write an RFC should be proportionate to the complexity of the task. If the work is moderately complex, you should get done with the RFC quickly, and many sections might not apply. For complex projects with many dependencies and many risks, you’ll have to spend more time on this.
As Uber grew to well over 2,000 engineers, the RFC process started to cause friction. The biggest problems were these:
Noise. By this time, hundreds of RFCs went out weekly, most of them to large mailing lists, and engineers also sent them directly to each other. While this was great for visibility, it overwhelmed more experienced engineers who were asked to look at many RFCs.
Ambiguity on which work needed an RFC. Every team had autonomy to decide how and when to write an RFC. At hundreds of teams, this meant many different interpretations.
Discoverability. RFCs were stored as Google Docs, and finding these documents was not easy. Some teams stored them in Google Drive folders, others linked to them in wiki pages.
Led by one of Uber’s principal engineers, the company overhauled this process and rolled out a tiered Engineering Planning Process. The changes were these:
Tooling. Uber built a tool where all engineering planning documents and Product Requirements Documents (PRDs) were stored. This tool was searchable, with approvers clearly marked, and approval tasks integrated with Uber’s task systems, Phabircator and JIRA.
Simplifying for smaller changes, more formal for larger changes. The company introduced lightweight templates for changes limited to team scopes, to help with the problem of overly long documents for relatively simple changes. More heavyweight templates were created for large-scale changes with organization or company-wide impacts.
A tiered approach. The company defined the process to decide how “critical” a change was. For the most critical changes, formal reviews were put in place where experienced engineers in an organization would do reviews on a weekly basis, on top of the asynchronous reviews. For less critical changes, teams had the freedom to adopt the approach they wanted; they could do one of these reviews but were encouraged to keep working as they did so.
2. RFCs and Design Documents
Many Big Tech companies and high-growth startups have ended up with a process similar to Uber’s, and settle on a form of the Design Document or RFC process.
Here’s how it usually works: