What is Data Engineering?
A broad overview of the data engineering field by former Facebook data engineer Benjamin Rogojan.
Q: I’m hearing more about data engineering. As a software engineer, why is it important, what’s worth knowing about this field, and could it be worth transitioning into this area?
This is an important question as data engineering is a field that is without doubt, on fire. In November of last year, I wrote about what seemed to be a Data Engineer shortage in the issue, More follow-up on the tech hiring market:
“Data usage is exploding, and companies need to make more use of their large datasets than ever. Talking with hiring managers, the past 18 months has been a turning point for many organizations, where they are doubling down on their ability to extract real-time insights from their large data sets. (...)
What makes hiring for data engineers challenging is the many languages, technologies and different types of data work different organizations have.”
To answer this question, I pulled in Benjamin Rogojan, who also goes by Seattle Data Guy, on his popular data engineering blog and YouTube channel.
Ben has been living and breathing data engineering for more than 7 years. He worked for 3 years at Facebook as a Data Engineer and has gone independent following his work there. He now works with both large and small companies to build out data warehousing, developing and implementing models, and takes on just about any data pipeline challenge.
Ben also writes the SeattleDataGuy newsletter on Substack which is a publication to learn about end-to-end data flows, Data Engineering, MLOps, and Data Science. Subscribe here.
In this issue, Ben covers:
What do data engineers do?
Data engineering terms.
Why data engineering is becoming more important.
Data engineering tools: an overview.
Where is data engineering headed?
Getting into data engineering as a software engineer.
Non-full subscribers can read Part 1 of this article without a paywall here.
With that, over to Ben:
For the past near decade I have worked in the data world. Like many, in 2012 I was exposed to HBR’s Data Scientist: The Sexiest Job of the 21st Century. But also like many, I found data science wasn’t the exact field for me. Instead, after working with a few data scientists for a while I quickly realized I enjoyed building data infrastructure far more than creating Jupyter Notebooks.
Initially, I didn’t really know what this role was that I had stumbled into. I called myself an automation engineer, a BI Engineer, and other titles I have long forgotten. Even when I was looking for jobs online I would just search for a mix of “SQL”, “Automation” and “Big Data,” instead of a specific job title.
Eventually, I found a role called “data engineer” and it stuck. Recently, the role itself has been gaining a little more traction, to the point where data engineering is growing more rapidly than data science roles. Also, companies like Airbnb have started initiatives to hire more data engineers to increase their data quality.
But what is a data engineer and what do data engineers do for a company? In this article, we dive into data engineering, some of its key concepts and the role it plays within companies.

1. What do data engineers do?
How do you define data engineering? Here’s how data engineer Joe Reis specifies this term in his recently released book, Fundamentals of Data Engineering:
"Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.
Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning."
In short, data engineers play an important role in creating core data infrastructure that allows for analysts and end-users to interact with data which is often locked up in operations systems.
For example, at Facebook there is often either a data engineer or a data engineering team which supports a feature or business domain. Teams that support features and products are focused on helping define which information should be tracked, and then they translate that data into easy-to-understand core data sets.
A core data set represents the most granular breakdown of the transactions and entities you are tracking from the application side. From there, some teams have different levels of denormalization they might want to implement. For example, they might want to denormalize if they remove any form of nested columns to avoid analysts having to do so.
Also, many teams will set standards on naming conventions in order to allow anyone who views the data set to quickly understand what data type a field is. The basic example I always use is the “is_” or “has_” prefix denoting a boolean. The purpose of these changes is to treat data as a product; one that data analysts and data scientists can then build their models and research from.
Our team at Facebook produced several core data sets that represented recruiting and People data. These core data sets allowed analysts and data scientists to see the key entities, relationships and actions occurring inside those business domains. We followed a core set of principles which made it clear how data should be integrated, even though it was being pulled from multiple data sources, including internally developed products and Saas solutions.
What are the goals of a core data set? Here are the three main ones.
1. Easy to work with. Datasets should be easy to work with for analysts, data scientists and product managers. This means creating data sets that can be easily approached without a high level of technical expertise to extract value from said data. In addition, these data sets standardize data so that users don’t have to constantly implement the same logic over and again.
2. Provide historical perspective. Many applications store data which represents the current state of entities. For example, they store where a customer lives or what title an employee has. Not all applications store these changes. In turn, data engineers must create data that represent this.
The traditional way to track historical changes in data was to use what we call Slowly Changing Dimensions (SCD). There are several different types of SCD, but one of the simplest to implement is SCD type 2 which has a start and end date, as well as an “is_current” flag.
An example of an SCD is a customer changing their address when they move home. Instead of just updating the current row which stores the address for said customer or employee, you will:
Insert a new row with the new information.
Update the old row, so it is no longer marked current.
Ensure the end date represents the last date when the information was accurate.
This way, when someone asks, “how many customers did we have per region over the last 3 years,” you can answer accurately.
3. Integrated. Data in companies come from multiple sources. Often, in order to get value from said data, analysts and data scientists need to mesh all the data together, somehow. Data engineers help by adding IDs and methods for end-users to integrate data.
At Facebook, most data had consistent IDs. This made it feel like we were being spoiled, as consistent IDs made it very easy to work with data across different sets.
When data is easy to integrate across entities and source systems it allows analysts and data scientists the ability to easily ask questions across multiple domains, without having to create complex – and likely difficult to maintain – logic to match data sets. Maintaining custom and complex logic to match data sets is expensive in terms of time and its accuracy is often dubious. Rarely have I seen anyone create a clean match across data that’s poorly integrated.
One great example I heard recently was from Chad Sanderson of Convoy. Chad explained how a data scientist had to create a system to mesh email and outcome data together and it was both costly and relied on fuzzy logic which probably wasn’t as accurate as possible.
At Facebook, even systems like Salesforce, Workday and our custom internal tools, all shared these consistent IDs. Some used Salesforce as the main provider and others used internal reporting IDS. But it was always clear which ID was acting as the unique ID to integrate across tables.
But how can data engineers create core data sets which are easy to use?
Now we have discussed the goal, let’s outline some of the terms you’ll hear data engineers use to make your data more approachable.
2. Data engineering terms
Let’s explain some commonly used data engineering terms.