What is ML Engineering?
A broad overview of the field, how it compares to software engineering, its relationship to AI, and a deepdive into how an ML-powered app works.
Heads up: I’m on my spring break, as per my holiday schedule. This means that this week and next week I’m publishing issues only on Tuesdays, and there is no Thursday, The Scoop issue. I’ll be back on reflecting on recent events with renewed energy after!
Q: “As a software engineer, I’m interested in machine learning (ML.) Could you give an overview of this field, and some basics worth knowing about?”
Machine learning is a hot topic and the popularity of this field is only growing, especially with the recent focus on large language models (LLMs) and the huge buzz about AI. It was just last week that we covered The productivity impact of AI coding tools.
For an overview of what machine learning is, I turned to Vicki Boykis, a longtime machine learning engineer, who’s been in the machine learning/data space for over a decade. She is currently a Senior Machine Learning Engineer at Duo Security, and was previously at Automattic (Tumblr, WordPress) and has worked as an ML consultant. Vicki writes a tech newsletter, a blog, and has organized a non-traditional and very interesting tech conference called Normcore Tech Conference, which took place last December.
In this issue, Vicki covers:
Her background
What is machine learning?
A brief history of machine learning
How do ML projects work?
Machine learning and “Artificial Intelligence” (AI.) How do they relate?
A deep dive into how an ML-powered app works
What does the ML landscape look like, today?
With that, it’s over to Vicki:
1. My background
Today, I work on building out machine learning infrastructure at Duo Security. My previous job was at Automattic, the parent company of WordPress.com, where I worked on search and recommendations problems both on WordPress.com and on Tumblr. Tumblr is the social media platform Automattic acquired back in 2019.
Industrial machine learning is a fairly young and evolving field, and like many in the industry, I didn’t start out doing ML.
I graduated with a B.S. in economics and a few years later completed an MBA. My first jobs out of college were heavy on data analysis; I started out as an economic consultant, implementing econometric models for international trade with statistical packages including SAS (data management and analytics software,) R (the programming language,) and Eviews (a statistical data package for Windows.) I did all my data analysis work locally, using Microsoft Access, with lots of Excel spreadsheet manipulation for client deliverables. While I learned the fundamentals of what later became known as data science, I was still very far from doing engineering.
My next job as a data analyst at a major American telecom was all of a sudden in an environment where I had access to a lot of data. We had so much data I couldn’t work with it locally anymore; everything was in Oracle and I needed to learn to write clean, optimized SQL queries which didn’t hang or compete for resources with those written by fellow analysts. I learned the ins and outs of SQL, how to query and analyze data, and the pain of dealing with null and missing values.
As a data analyst, I still worked mostly in SQL, but I was now part of an engineering team, which meant learning the cadence of engineering releases, sprints, and being exposed to production best practices. Sometime into my tenure, the organization decided to go all-in on what was then the latest “hotness,” a technology with the funny name, “Hadoop.” It allowed us to process volumes of streaming log data that our relational database infrastructure couldn’t handle.
When I wanted to learn more than just the outputs of the system presented to me as piles of text files in HDFS, I moved upstream from writing Apache Pig scripts, into the guts of Hadoop itself, into HDFS and Storm topologies. I started working less with SQL and more with shell scripting and Python. This was during the time of the dawn of the “Big Data” era, and working in it meant learning concepts like lambda architectures, and writing convoluted MapReduce logic on the Java Virtual Machine (JVM).
Before we continue, here’s an explainer of some terms mentioned above:
Hadoop: a framework to process large, distributed data sets across a network of computers. It consists of a storage part – known as Hadoop Distributed File System (HDFS) – and a processing part using the MapReduce programming model. Simply put, Hadoop splits files into large blocks, distributes them across nodes in a cluster, then transfers code packaged as Java archives (JAR archives) to process large amounts of data in parallel.
Apache Pig: a platform for analyzing large data sets. It was built to make it easier to work with Hadoop’s MapReduce framework, allowing developers to focus more on analyzing bulk data sets and spend less time writing Map-Reduce programs.
Apache Storm: a distributed stream processing computation framework. A Storm application is designed as a “topology,” similar to a graph. The edges of the graph are called streams, and they direct data from one node to another, describing a real-time data transformation pipeline.
Lambda architecture: a way to process massive quantities of data (“Big Data,”) by taking advantage of both batch and stream-processing methods. This approach balances latency, throughput and fault tolerance.
After being exposed to all these toolsets, I moved from pure data analysis into the realm of software engineering. Having previously worked reactively with data presented to me by upstream systems over which I had no control, I was finally empowered to write my own code to get my own data.
Being in control of a system end-to-end and writing imperative code, versus declarative-only SQL, appealed to me. At my next job as a data scientist at a SaaS product in financial services, I used the engineering skills I’d learned to build a warehouse for data modeling, and also in parallel, went deeper on ML theory, working with Monte Carlo simulations to model investments and building decision tree models to predict customer churn.
I then went into consulting, which exposed me to a wide variety of production-grade ML systems at Fortune 500 companies. Just as important, I learned what it means to have a product that works end-to-end, starting with defining customer needs. One major project was working on recommendations for a streaming platform similar to Netflix, which tracked user activity and showed relevant, personalized recommendations based on previous activity. At that point, I became very interested in both end-to-end ML systems and systems where the main machine learning task was information retrieval, i.e. trying to filter a lot of information to serve only relevant items to end-users.
With several data jobs under my belt, I realized three critical patterns of ML system design, to which I return over and again in the course of my work:
It’s impossible to build a ML system if you don’t have clean data.
It’s also not possible to build an ML system if you don’t have the infrastructure and organizational support to do actual ML.
A machine learning model that doesn’t go into production is not a valuable model, but you might not reach production on the first try. Good ML engineering practices rely very heavily on having time and space in a product release cycle to experiment and test assumptions.
2. What is Machine Learning?
Over the course of my career, I’ve worked at a range of companies, from tiny startups to corporations with over 100,000 employees. At these places I tackled a great variety of projects, including:
Building recommendation systems for millions of users
ML platforms in security
Data transformations in healthcare
Textual predictive modeling in patent law
Kubernetes in quick-service restaurants
If you’re reading these job descriptions and thinking: “that’s just data analysis!” “that’s just data science!” or “that’s just engineering,” then you’re right; they do all fall under the umbrella of MLE (machine learning engineering) work.
Machine learning engineering is a rapidly growing field encompassing backend software engineering, machine learning algorithms, and analytics and statistics. It combines traditional software engineering practices with domain-specific skills: those required to develop, deploy, and maintain ML models on ML platforms. That’s the general shape of the field, but the job looks very different depending on which organization you’re in, the business area of expertise, and the maturity of the tech stack.
What do we mean by ML platforms? They’re a collection of software components that create a systematic, automated way to take raw data, transform it, learn a model from it, and show results which support decision-making for internal or external customers.
For example, Netflix’s recommendations are part of ML platform. When you call an Uber or Lyft, or book an Airbnb, the chances are you’re being served software components developed by those companies’ internal ML organizations. These are very visible, consumer end-uses of machine learning, but ML is also used in the less visible parts of ML organizations, such as fraud detection at Stripe and at banks, and in content moderation on social media platforms.
MLE requires a range of cross-disciplinary skills that encompass the full spectrum, from prototyping models to delivering reports. However, holistically, MLE is a job which sits firmly in the software engineering family.
3. A brief history of machine learning
Just as my career has moved from analytics to deploying complex distributed systems, the ML industry has grown from solving problems at the scale of generating reports, to being put into production at scale in the largest companies in tech today.
The early history of computing – and machine learning as a subset – was built on mainframes, which are enormous data servers that processed and kept track of data they generated in isolation from each other. In the 1970s, university researchers started configuring these standalone machines to talk to one another. ARPANET could be seen as the predecessor of the Internet. Here’s what researchers at the University of Utah wrote in 1970, after connecting the university’s computers to ARPANET:
"We have found that, in the process of connecting machines and operating systems together, a great deal of rapport has been established between personnel at the various network node sites. The resulting mixture of ideas, discussions, disagreements and resolutions, has been highly refreshing and beneficial to all involved, and we regard human interaction as a valuable by-product of the main effect."
When compute moved from machines to the network, the creators of these systems started keeping track of data movement and logging it, to get a consolidated view of data flows across their systems. Companies were already retaining analytical data needed to run critical business operations in relational databases, but access to that data was structured and processed in batch increments on a daily or weekly basis. This new logfile data moved quickly, and with a level of variety absent from traditional databases.
Capturing log data at scale began the rise of the Big Data era, which resulted in a great deal of variety, velocity, and volume of data movements. The Apache logfile was one of the greatest enablers of Big Data. The rise in data volumes coincided with data storage becoming much cheaper, enabling companies to store everything they collected on racks of commodity hardware.
When the data science boom hit in the early 2010s, the initial focus was on collecting as much data as possible, understanding a company’s datasets and creating models from them.
Once companies realized they sat on potential goldmines of unstructured data – none of which were actionable – they started hiring data scientists to make sense of it all. The practice of modern data science arose from statisticians who observed the amount of data being generated and processed required methods beyond the scope of academic statistics and processing methods on single machines.
And so data scientists specialized in building models and making sense of piles of data. However, when it came to creating model artifacts that could continuously run, and at scale, data scientists relied on the help of production engineers to rewrite models in production-facing languages. At this stage, there was a very split hierarchy:
Data engineers who built systems to process and manage data at scale. Read more about what data engineering is in this article.
Data scientists who created the models.
This split became known as the ‘A/B divide in data science,’ A and B: analyst and builder: the people who analyzed the data or built models, and those who built systems around those models. As companies began to understand and expand the use of ML, and the split between A and B data scientists became more refined, and the data engineer and data scientist roles subdivided into several new, even more specialized roles:
Data engineering: responsible for engineering the pipelines that bring data, schema enforcement, and streaming data
Data analysts/data scientist: Analyze data for organizational needs, build dashboards, and present analyses
Analytics engineer: Somewhere between data engineering and analysts, focuses on enforcing data movement between analytics tools and relational databases
Researcher: Build conceptual models, test them, responsible for research and scientific discovery
Machine learning engineer: Focus on building the models and productionalizing them, regardless of model type: anything from simple linear regression to ChatGPT
At the same time, model artifacts and deliverables moved from pure one-off analyses that lived on individuals’ laptops, to production software such as the services powering web applications like Amazon and Netflix’s recommendations, risk scoring for fraud, and medical diagnostic tools. This software required models be portable, low-latency, and be managed in a central place, which meant building systems and platforms on which to manage them. The term “MLOps” arose to define the boundaries of model management and operationalization.
It became clear that in order to successfully traverse this landscape, a data scientist needed to be a “person who is better at statistics than any software engineer and better at software engineering than any statistician.”
To solve the complex problems of coordination, analysis, and engineering that machine learning in production involves, many companies and teams worked on in-house platform approaches which were open-sourced as top-level Apache projects, or turned into tools built by vendors.
Examples in the ML ecosystem are:
Apache Spark: a distributed batch computational platform processing large quantities of data that’s still widely used today
Apache Airflow: a scheduler for data engineering pipelines and data orchestration
Apache Lucene: a search engine. Elasticsearch is one of the most popular distributed and horizontally scalable search frameworks built on top of Apache Lucene.
Today, the ML landscape is enormously wide and deep in terms of tooling and approaches, and it’s the MLE’s job to make sense of them all in order to deploy production-grade models.
4. How do ML projects work?
Machine learning projects differ from general software development in their workflows. In a typical SaaS product, the user-facing app is made up of product features written in code. Product features are a broad concept, but are generally things that change the functionality of an app, like the “home timeline” or “feed of recommendations,” or the “ability to post images”.
When we think of a typical software development lifecycle, it’s roughly this: