Migrations Done Well: Part 2
Executing migrations: preparing for them, pre-migration steps, migration strategies, after the migration and the long-tail.
👋 Hi, this is Gergely with this a bonus free issue of the Pragmatic Engineer Newsletter. In every issue, I cover challenges at big tech and high-growth startups through the lens of engineering managers and senior engineers. If you’re not a subscriber, here are recent issues you missed.
Subscribe to get weekly issues. Many subscribers expense this newsletter to their learning and development budget.👇
This is Part 2 in the 3-part series of Migrations Done Well.
Part 1: Typical migrations (the previous article)
The stories of four different migrations
Types of migrations
Part 2: Executing a migration (this article)
After the migration
The migration’s long-tail
Part 3: The people and the business side of migrations (the upcoming article)
The people aspect of migrations
Selling migrations to the business
Closing advice for migrations
1. Preparing for migrations
Migrations are risky and when they go wrong, they can cause all kinds of significant damage. However, if you do some groundwork before starting the migration, you’ll reduce risk, gain confidence and understand the scope of the migration better. Here are a few things you should do:
Understand the reason for the migration. Why is this migration needed? What is wrong with the old solution? What is the impact of the old solution being unsuitable? What would happen if this migration did not complete?
What were the constraints in the past? When a migration is needed, it's because the current solution fails to meet a particular requirement. This could be availability, performance, or something else. Outline what these capability constraints are. You’ll need to make sure the new system solves for these past constraints.
A map of the differences of Old vs the New. Outline the old system vs the new system. What is changing and why, and how do these systems work? Are external interfaces changing at all? Are you adding or removing functionality for the new system?
Consumers and producers needing migration. Which customers of the old system need to be migrated to the new one? Map customers of the old system. Will all consumers move over, or will some remain on the old system?
Write and share a migration plan. Write up how the migration will happen. What are the migration steps? Who will do each step? For example, who will build the new service, who will migrate each of the consumers, and who will monitor the progress of the migration? Put this plan in writing and share it with consumers whom the migration will impact.
Capacity planning. What load is the current system under? What load will the new service need to handle? For load planning, don’t only look to the time of the migration, but look ahead into the future. How will you ensure the new system has the capacity to handle this load? How will you confirm it can handle the load?
Error budgets. For complex and large migrations you can expect some things to go wrong. You might also have migrations where you expect a small percentage of traffic to not work as expected. Instead of just accepting things going wrong, budget for it. What is the acceptable limit for errors during the migration? This could be defined as latency degradation, % of data that is not consistent for a certain time, percentage of application errors during the migration, and so on. By defining error budgets you’ll also have to measure them to keep yourself honest.
Edge cases. What are potential tricky situations you need to take into account? Write these up, then share it with the team working on the migration and all consumers impacted, so they can also add their edge cases. Common edge cases you should worry about include:
What happens with data producers during migration? For example, say you have processes writing to the database. What happens when the migration happens? Will these writes continue? Will you stop them? Will they be queued?
How will service consumers behave during migration? Will there be edge cases when clients behave in unexpected ways, or experience data consistency issues as they interact with a system undergoing migration?
Security audit of migrations. What are the attack vectors during and after the migration? Are there vulnerabilities a bad actor could take advantage of, as the migration happens? Vulnerabilities might not be strictly related to the code or data, they could include phishing attacks.
This is exactly what happened to NFT marketplace OpenSea, in February of this year. The company announced a week-long migration to customers. A bad actor realized this migration was an opportunity for an attack and sent out emails that looked like they were from OpenSea, asking customers to take action for the migration to happen. Customers then were tricked into authorizing a smart contract they believed was from OpenSea, and dozens of unlucky customers lost over $1.7M in NFT assets.
Had the company predicted such phishing and sought to counter it, they could have chosen to either shorten the migration window, communicated the phishing risks, or have monitored suspicious smart contracts authorized during the migration period, and warned users against executing them.
Run a migration pre-mortem. As engineering director Nathan Gould suggests:
1.1 Preparing for data migrations
Data migrations bring additional complexities of their own and tend to require more thorough planning. Some additional complexities you might need to consider are:
Too much data to migrate in one go. In the case of service replacements, this means that the new service might need to still use the old service to look up older data. In this scenario, smart approaches could be used. For example, when modifying a record, the new service might create a copy from the old system, and stop using the old system for that record.
How will data producers be migrated? What happens with data-producing processes, for example, processes writing to the database at the time of the migration? Will they be paused? The writes queued? The writes dropped?
Active-active challenges. When using an active-active data storage setup – distributing data across several clusters for resilience – you’ll have additional challenges when migrating. If migrating with downtime, you might be able to migrate all clusters during this downtime. However, if running a zero-downtime migration, consumers will see inconsistent data. How will they deal with this inconsistency?
Rollback. If the migration has issues, how will you revert to the previous state? When utilizing downtime to perform the migration, the rollback might be as simple as not switching over to the newly migrated data. However, with zero downtime migration, you might need tooling to write back data committed to the migrated database as you roll back to the old database.
Disaster recovery considerations. In the event of a major issue like data corruption or a ransomware attack, how easy is it to roll back to a pre-migration state of the system? Are there sufficient backups? Should you make a last snapshot save before the migration proceeds?
2. Pre-migration steps
You have a plan in place and are confident you have the edge cases covered. Can you start the migration? Almost certainly not yet! First, you’ll need to ensure you have monitoring in place so you can track the status of the migration and detect problems. You’ll also want to validate that the migration will work with shadowing, dry runs and other processes.
Monitoring the migration is the single most important action which can make a migration successful, and detect one going wrong. The lack of dedicated migration monitoring is the reason for most migrations causing outages, in my experience.
Define what needs to be monitored. How can you tell if the migration goes well? What do you need to measure to know that customers are not being impacted and the service is healthy? For data migrations, how can you measure that the data is not corrupted?
Build the monitoring systems. Put in place the graphs, tools and alerts that tell you if the migration is healthy or if it has issues.
Throwaway monitoring is to be expected! Many engineering teams are hesitant to build monitoring as they’re used to building graphs and alerts for permanent features. Break away from this thinking! You need to build monitoring and yes, it will be removed once the migration is done.
Do you have monitoring for all error budgets? Remember how you defined your error budgets? Make sure there’s monitoring for all of those metrics.
If it’s painful to build one-off monitoring, take note for later. At companies which have invested heavily in platform teams, building monitoring for migrations should be a breeze. If it’s not, then you should incentivize making it easier to do. Do this by contributing to tooling, or by pushing management to invest in tools, be it via internal platform teams or contracting with vendors.
Validate that the migration will work as expected. Once you’ve completed most of the development work for the migration, you’ll want to hold off rolling out in production. Instead, you should validate the migration with production data and traffic. Here are some common ways to do this.
Shadowing, also commonly referred to as ‘parallel run’ or ‘shadow loading.’ This is a common approach, both with service replacements and service integrations. This approach means sending all production traffic to the new system or new integration as well, and monitoring that it works as expected.
Shadowing has several benefits:
Migration issues are caught early.
It’s as close to pre-production testing as it gets.
It serves as load testing. If your shadowed system can handle production load, you can be confident that it won’t have issues when switching over.
There are a few caveats with shadowing:
Shadowing validation. You’ll likely need to build shadowing validation tooling to confirm the new system works as expected.
Mocking might be needed to avoid side effects. The new system might need to mock certain functionality. For example, if you are migrating an old system which also sends emails in certain scenarios, you’ll want to stop the new system from sending emails while in shadowing mode. Doing so would result in emails sent twice!
Not always practical. There may be times when shadowing is not an approach that’s pragmatic. This is the case when you would have to mock much of the shadowed system’s capability. For example, if you are migrating the email layer in an application, shadowing email sending without sending emails might be meaningless.
Load testing is a common approach to confirm that the new system has the capacity to handle future, increased loads. While shadowing can give a sense of current load-handling capability, the goal of load testing is to simulate extreme loads.
Common ways to do load testing are these:
Use mocked or generated data. Generate test data and execute a load test. The benefit of this approach is that you can tweak the load test characteristics easily. The downside is that the test data will not exercise edge cases which only real-world data has.
Use production data, but in bulk. A more common load testing approach is to sidestep test data and use real data. Collect production data for a while, then use this production data for load testing.
Combine shadowing with a bulk release of production data. We used this approach at Uber with a system called Hailstorm. It buffered all production requests coming in for a defined time and replayed it in a shorter time window. For example, we released 10 hours worth of production data in one hour.
Performance sampling is an approach worth doing for high-load systems, especially when you are changing technologies like frameworks or programming languages. It’s easiest to do when combining with shadowing. What are the performance characteristics of the new system, compared to the old? How has latency changed, and has resource usage like CPU utilization and memory usage, decreased or increased?
Do a dry-run of the full migration. Why wait to do the migration in production, when things may go wrong that cause outages? If possible, do a dry-run migration and inspect whether things work as expected.
You can combine a dry-run migration with shadowing. Dry-runs are common to do with data migrations, where this might be the best way to confirm that data moves as expected.
Test for events that will only happen in future. As an edge case, your migrated system might have to deal with edge cases that only happen on certain future dates. For example, if migrating a billing system that sends out bills at the end of the month, you’ll want to test that this functionality works.
One option is to do shadowing for long enough so that these future events occur, and you can validate them in real time. However, you might not have the luxury of waiting, especially if these events happen once every few months, or once a year. In this case, you’ll have to simulate these events and validate that the migrated system handles them as expected. A good example is testing billing systems like this:
3. The Migration
We’ve planned the migration, done pre-migration validation and perhaps even shadowed the new system, assuming it’s practical to do. Time to start the migration!
The most important decision in any migration should be whether or not to have downtime for consumers of the system being migrated. This decision will determine strategies you can use for the migration itself.
Zero-downtime migrations are when customers notice nothing of the migration. These are the ones that need the most preparation to achieve.
Doing zero downtime migrations is often a steep learning curve and involves more upfront work, when done the first time. However, at companies where zero downtime is the norm, the amount of additional work drops over time as people get better at it, and at building or utilizing tools which aid the process.
There will be many cases when zero downtime is too expensive to do, in terms of time spent building tooling. There will also be cases where it’s not possible, for example, with infrastructure migrations. However, if you start by considering what it takes to do zero downtime migrations, then you can make a more informed choice.
I would encourage teams to do at least one zero downtime migration, if they’ve never done one before. Teams that have never performed such a migration often overestimate their complexity, and underestimate the benefit of not having to work outside business hours when doing zero downtime migrations.
Migration with downtime affecting few to no customers is when you take the system down, but customers don’t notice anything happened. How is this possible?
For one, your business might be intentionally offline during certain hours. For example, if you have systems servicing the stock market which only operate during business hours, customers won’t notice migrations done outside these hours.
If your migration involves non time-sensitive functionality, you can also get away without customers noticing. In these cases you might be able to take the old system offline, queue incoming requests, do the migration, then replay those requests for the new system to process. For example, if you are migrating a system that sends out marketing emails, customers won’t notice – or care – that some marketing emails arrive an hour after they usually do, thanks to the migration.
Migration with reduced functionality is an approach where you don’t introduce downtime to a system, but some functionality will go offline. A common case is to turn a system to ‘read only’ mode during a data migration; all read operations still function, but new data can not be written. That data will either be queued for later writing, or it might be discarded until the migration is complete.
Migration with planned downtime is an approach where you take the current system fully offline, perform the migration, then bring the migrated system online. You are able to accurately estimate the downtime needed and can communicate this downtime, ahead of time. You typically choose outside peak hours to perform the migration, usually in the middle of the night for most customers, or at the weekend.
Migration with downtime has the benefit that you don’t need to worry about edge cases of consumers accessing a system that is not fully migrated.
The downside of this approach – of taking the system offline – is that it adds pressure; if something goes wrong with the migration, there’s not much time to fix it without exceeding the communicated downtime window.
Migrations utilizing regular downtime periods is an approach common at more traditional businesses which have been doing migrations for years, or decades. Several banks fall into this category, which commonly allow for downtime to happen over the weekend or during bank holidays.
Having regular downtime periods that can be used for migrations is convenient for engineering teams. They can use these often long timeframes and not have to worry about a migration going wrong as they have ample time to revert it, and to get it right in the next period.
I personally find regular downtime periods tempt engineering teams to take the easy route of not preparing well for migrations, and disincentivize zero-downtime approaches. On top of this, regular downtime periods incentivize working outside business hours like at weekends or late at night.
How will you perform the migration? Here are the most common migration strategies:
Switch over. With a flip of a switch, or more often a configuration chance, you route all traffic to the new system. This is usually done after extensive shadowing.
There are migrations where a switch over is the only sensible strategy. Code migrations are one of them, and data migrations might be, too. Migrations that utilize downtime are almost always ones that use a switch over method once the migration is complete.
Staged rollout. This means rolling out the migration gradually to parts of the system, or to a certain group of consumers. As the rollout proceeds, the team monitors the system to make sure it works as expected, and pauses or reverses the rollout when they see issues.
Staged rollouts are popular when releasing new features, gradually rolling them out to all users. This means that many teams already have access to the tools – like feature flags – to use for staged rollouts.
Several types of migrations can easily be done as staged rollouts. However, more complex ones like data migrations or infrastructure migrations often require extra complexity to be added, in order to use a staged rollout approach. This is because in those migrations, both data migration and code changes often need to be tied together. A staged rollout with a data migration might mean writing more code to keep the migrated data and the code executed in sync; this new code is yet another source of bugs.
Writeback to the old system is a common approach with both service replacement migrations and some data migrations. In cases where the existing service has many internal consumers, the migration to the new system does not move these consumers over to the new system.
Instead, the new system writes data back to the old system, allowing for consumers to operate without changes. Now, the migration is complete in the sense that the new system operates as primary. However, there will be a long-tail migration effort to move all consumers of the old system to use the new system.
Let’s get to the migration! Here are approaches to consider using: