Real-World Engineering Challenges Roundup

C++ code sharing, accessible colors, migration patterns and selling more milkshakes. Issue 2.

This is a bonus issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at Big Tech and high-growth startups through the lens of engineering managers and senior engineers.

If you’re not a subscriber yet, here’s what you missed in previous editions:

Subscribe to get this newsletter every week 👇

Real-World Engineering Challenges is a bonus column on top of the weekly articles. Each month, I share a handful of the most interesting real-world engineering challenges I’ve come across. All the articles cover interesting engineering approaches in-depth, and you can learn something new by reading them, and diving deeper into the concepts they mention. 

Snap improved an open source tool that Dropbox started, but no longer maintains

Snap uses Djinni, a tool that generates bridging code between C++ to and from Java and Objective C. The tool was built by an open sourced Dropbox, which used it to share C++ code between Android and iOS applications. In 2018, Dropbox decided to move away from shared C++ code, so they stopped maintaining Djinni, and wrote an interesting post on why they no longer share C++ code for their mobile apps. The short of it is that there was too much overhead.

Meanwhile, Snap also started using Djinni in several of their mobile apps, including Snapchat, Lens Studio and Snap Camera. They saw plenty of improvement opportunities. However, with Dropbox no longer maintaining this library, they went ahead and forked it, and maintain this fork to this day.

This article summarizes a few improvements they made to the library:

  • Better string marshalling performance for large strings, which improved performance 3-10x. Though the improvements are measured in nanoseconds, these improvements add up at scale.

  • Zero-copy buffers using references over binary types. Their explanation on implementing passing buffer data between C++ and Java, is educational.

  • Eliminating Java finalizers in the generated code and reducing crashes caused by these finalizers. The explanation showcases a non-deterministic crash that happens because of how the Android garbage collector works.

I particularly like this article as it showcases how one company open sourcing their internal library can help other companies, which can then take over maintenance and give back to the broader engineering community. Nice work, Dropbox and Snap!

Read the full article

Stripe built a tool to easily design accessible color systems

When I was at Uber, our mobile team relied entirely on our designers to set colors and fonts. While engineers would advocate for iOS and Android accessibility, I personally never paid much attention to colors, or realized the important accessibility role they play for visually impaired users.

Stripe wanted to create a tool that gives real time feedback on accessibility, instead of hand-picking or muting colors that most teams rely on.

The article covers RGB color spaces, the difference between display color space and perceptually uniform color spaces, both which are important to understand when working with colors. Stripe built a web-based tool that allowed them to manipulate colors in perceptually uniform color spaces, and come up with accessible colors that also look great. 

I love the approach of building a tool to iterate faster, instead of just manually hand-picking colors. I now wish Stripe had opened up this tool for others to use. I do see how having this in-house could give a competitive advantage in iterating faster on designs. Still, one can hope.

Read the full article

Zalando is moving off their monolith with the parallel run migration pattern 

E-commerce company Zalando has outgrown their now legacy monolith application. As they were building new functionality for customer returns, they wanted to extract all business logic related to customer returns from this monolith, and move it to a standalone service.

The idea of how do the migration is this:

“Wouldn't it be nice if we could verify that each request handled by the new system would be handled in exactly the same way as for the system currently running in production? The parallel run pattern does exactly that.”

The article describes the pattern in more depth. I like how it details them going about monitoring and reporting on the consistency results, and not only mentions rollout, but also how they closed the process with the cleanup. The cleanup step is something I’ve observed most teams don’t plan, which is often why leftovers get left across the codebase.

Read the full article

The evolution of developer environments at Lyft from EC2 to Kubernetes

The Lyft backend started off as a PHP monolith. In 2015, with around 100 engineers, the first standalone microservices started to emerge.

In the early days, the platform team would manually provision an EC2 instance for developers. But this was tedious.

As the next step, Lyft built the Devbox development environment. Devbox managed a virtual machine for the developer, which set up the developer environment with databases, Envoy proxy sidecars and all other bells and whistles. A few minutes after starting it, the Devbox was good to work on.

The biggest problem with Devbox was that these environments were not long-lived and could not be shared with other engineers or designers. As the next step, Onebox was born; basically, a Devbox living on an extra-large EC2 instance with 16 vCPUs and 122GB of memory.

By 2020, several downsides of the Devbox / Onebox approach had started to appear: scalability, maintenance, ownership and bloated tests were all issues. As Lyft moved development environments over to Kubernetes at this time, they revamped their approach on local development, which is where this article ends.

I enjoyed this article as it shows how much effort Lyft has invested in the backend developer experience, from the early days. It’s a good reminder that if you’re at 100 engineers in your organization and don’t have a dedicated team looking at developer environments, you’re likely behind in engineering efficiency to where Lyft was in 2015.

Read the full article

Pinterest increased recommendation relevance by producing query embeddings

Embeddings – representations of discrete entities as vectors and numbers – are important signals to classify, retrieve and rank relevant content. Pinterest built SearchSage, a search query recommendation to increase the relevance of recommendations and engagement across 15 search products.

Before SearchSage, the Pinterest team would identify the cover Pins, fetch their embeddings, cluster them and issue approximate nearest neighbor (ANN) queries against a hierarchical navigable small world (HNSW) index. If like me, you didn’t grasp much of this on the first readthrough, the article visualizes it nicely. This approach handled query polysemy, but the system was not learned end-to-end.

The rest of the article goes in-depth on how Pinterest used a two-tower model to build a new approach, how they trained their model with pairs, and how they evaluated the results. They go through the model, and how they served the model with a framework built on top of TensorFlow Serving.

This approach increased both product-only search relevance and overall relevance for shopping-related queries.

I am no expert in neural modeling: but if you, like me, want to get a sense of where this domain is at, I recommend not just reading, but taking the time to research and understand concepts in this article.

Read the full article

Grab builds better products the same way McDonalds sells more milkshakes

Grab decided to utilize the Job to Be Done (JTBD) framework. This framework was popularized by Harvard Business School professor Clayton Chirstensen, in his book Competing Against Luck. In the 1990s, McDonalds used this framework to figure out how to sell more milkshakes – and they did so.

Grab conducted a similar study to McDonalds’, in order to figure out which features to build for GrabFood. After the exercise, they found out the biggest difference they could make for both customers and vendors was to add a bundles functionality, which they did.

I like how GrabFood looked to examples from a familiar but separate domain – the food industry – and experimented with an approach that worked 30 years ago for McDonalds, and made it work for them. It’s a good reminder that technology might change rapidly, but proven methods to innovate can work just as well today as they did back in the day.

Read the full article

A recommendation for my editor. Writing is an undervalued engineering and leadership skill. Several people have asked me what has helped me become a better writer. These have been, in order:

  1. Write regularly

  2. Hire an editor to give you unfiltered feedback and suggestions

  3. Ask for feedback on your writing from peers

  4. Read books on writing better and use tools that help with your writing

What I wish I did earlier to grow my writing skills, is to hire an editor. My editor is former journalist Dominic Gover, who not only helps make each newsletter issue more pleasant to read, but with every edit, he helps me become a more efficient writer. You can contact him through his website or on Twitter. Here’s an example of the difference his editing makes.

Share The Pragmatic Engineer

🤔 How would you rate this week's newsletter?

AmazingGreatGoodOKSo-so

Read something that would be relevant for this column? Share it with me.