Seamless software migrations

Authors : @raghavendra-sc Charugarg Anuja Barve

Change is constant, migration is a journey towards the change...!!

With everyday advancements and the emergence of new technologies that are happening at a very high speed, we have been seeing that lot of existing technologies are getting deprecated in a very short time.

It's not easy to switch…

At the same time, it is not very easy for anyone to get on to the new technology and replace that existing. There are many reasons why someone would hesitate to do so

  1. There are many integrations with the existing technology. Not easy to move all of them. Especially, platform services/offerings.
  2. Differences in the contracts of existing and new technologies.
  3. No feature parity, no support for existing features.
  4. A lot of data created with existing contracts, complex conversions required to map to new contracts.

End up creating a clone

Most of the time, we start writing a new stack from scratch with the new technology instead of touching the existing one, and we end up creating a clone that is serving the same purpose with few additional benefits coming out of new technology. Start building new integrations with the new stack. But there are other downsides to doing this.

  1. Additional maintenance, resource, infra and support cost to maintain both existing(old) and new stack.
  2. No support for new feature development on the old stack for existing customers.
  3. Ask customers to move to a new stack, which requires changes in the customer integration, not easy.
  4. We may have to lose some old features that are not supported in the new stack.

Bridge and switch

While we start building the clone, we should start building the necessary components to bridge the old and new stacks, so that the new stack can serve the old integrations seamlessly.

Components necessary to bridge the old integrations could be

  1. Contract mappers(both forward and reverse mappers) between the old and new stack.
  2. Controllers with the same APIs if you are building a service.
  3. Additional components required for the feature parity of the old stack on new systems.
  4. Data migration scripts.

Validate the bridge

It is very important to validate that the components built for bridging the old and new stacks will work as they were before. You should additionally build the components for validation. These can be

  1. Enabling live diff or dual read between old and new stack
  2. Data validation jobs between old and new stack

Functional Parity

One of the most important things to consider in a transparent migration from one stack to another is that the interactions supported by the new system should be the exactly same as the old system. Certain steps will help achieve this final state.

100% Functional parity

The new system should support 100% functional parity of the old system on new system behavior until customers completely migrate themselves to integrate with the new system.

Request/Response Mapping

  1. The new system should have exact same API paths with the same request-response signature.
  2. If the new system is built on a new stack and has evolved API signatures, Implement the request-response adapters in the new system if required to convert between old/new signatures.
  3. Even in the case of unsuccessful calls, API signatures should have a 100% match between the old and new system. It should have the same response structure including response status codes, error codes, error messages etc.
  4. Never ever assume anything w.r.t to customer integrations, even a single additional or a less attribute can lead to catastrophic failure.
  5. Ensure that SLAs(latency, load) of the new system are the same or better than the old system

LiveDiff a.k.a Dual Read

When you are ready to take the traffic of the old system on to the new system, you must measure the success rate of the implementation in the new system by making live diff calls before doing the real cutover.

What is the live diff?

Livediff is a dual read to make calls to both new and old systems for the same request and compare the response, both successful and unsuccessful to verify if both are the same. There are few ways, you can do a live diff with

  1. Gateway interceptor: Intercept calls to the old system at some middleware component such as a gateway, save the response, and make another call to the new system with exactly the same request parameters, compare and log the responses.
  2. Async calls from the old system: Have call interceptor in the old system, capture the response, make an async call to the new system, get the response, compare and log the results.
  3. Fire and forget: Have call interceptor in the old system, capture the response, make an async call to the new system along with the response, compare and log the results in the new system. This will make the live diff calls lighter with a lesser load on the old system.

In all the above approaches, the live diff calls should not create any impact such as latency, load etc while serving traffic from the old systems.

100% vs random sampling

Depending on the requirements, one can choose to do live diff for 100% of the calls or random sampling to validate different access patterns. 100% sampling may be required in systems, where each call is a unique request vs random sample works well, where you have a fixed set of call patterns. Algorithms like a leaky bucket can be very handy for random sampling.

Data Migration

Data migration is an important part of any migration to allow the old system to continue to work the same as it was before. If the data model of the old and new systems are different, then data migration is not going to be straight forward.

Model Mapping

Data model mapping between old and the systems will be the first step towards data migration. We should map every attribute in the old model to the new model, mapping of high-level entity attributes are easy to map, as we go deeper into data model hierarchy things get complicated. Especially, when there are features that are being unsupported in the new system, you still need to find ways to support such features to allow the migration.

Data converters

Implement the data convertors for model mapping between the old and new systems. These mappers shall act as a bridge for to and fro conversion between both the systems. There are many object mapping frameworks which enable data conversions using simple attribute mapping configuration.

Data migrators

Implement the data migration scripts to migrate data from old to the new system, these scripts can be run as offline jobs that pass through the data convertors and store the data in the new model. Offline jobs can directly talk to the databases to extract and store data OR can make calls to old and new system APIs to pass the data through the read/write/create APIs OR you can even build hybrid an approach.

Data validation

Data validation is going to be utmost important task of any migration to measure the correctness of the migration. Implement data validation jobs to read and compare data between both systems and publish the correctness of the data. Data validation can be of many types

  1. High-level validation to just compare the counts of records, columns, types, and categories of data between old and new systems. This will be the fastest validation and can be run anytime.
  2. Full validation of data comparing every field between old and new system, this validation shall ensure high-level data accuracy. This is going to a highly time-consuming task, may take a few hours to days for full validation. Run this occasionally.
  3. Full validation with random sampling, even this approach shall ensure the high level of data accuracy between old and new systems and can be run anytime. This shall take much lesser time compared to #2, as it randomly picks the data samples and compares.

Dual Write

In systems with continuous traffic, we cannot just pick a day and migrate the data till that day as the data is getting created and modified continuously. We should build the dual write capabilities to keep the data in sync between old and the new systems. Dual write should ensure every operation that is performed on the old system shall be replicated in the new system as well and both are in sync.

Transparent Cut over

Migrations seem to be a very hefty operation to consumers particularly because a lot of change is expected at their end. If the are many consumers who are using the old system, it is not going to be easy to coordinate with all of them to get them migrated over to the new system. The answer is “Transparent migration”. Transparent migration enables consumers to have zero change to their integrations and seamlessly they’ll be migrated to a new system.

A couple of things to keep in mind while doing a Transparent Migration are:

  1. Looking for an approach that is backward compatible so that in case of any issue, we can go with the fail fast technique and do an immediate rollback.
  2. Understanding the existing routing and see if there is a way to plug in the new change without disturbing any of the other routes.
  3. Doing the cut over in a controlled fashion so that the blast radius is minimized.

Phased approach vs All at once traffic

When rolling out any change, it is very important to keep in mind what is the number of consumers who will get impacted by it. It is always a good idea to list down the number of consumers who will get impacted by the change. Once you have this percentage, the decisions should be made in a data-driven approach.

That is one should start doing the migration with the consumer with minimum traffic. This way even if something goes wrong, the impact blast radius is limited without disturbing other consumers.

Sample transparent cut over strategy to selectively dial the traffic to new system

Alerts and Monitoring

Build the right alerts and cut over monitoring dashboards to measure the success of the migration.

Monitoring dashboards for

  1. Cut over logs
  2. Throughput per sec(TPS)
  3. Latency numbers
  4. Successful Vs Unsuccessful calls
  5. Stats by consumers, API
  6. Functional logs

Alerts

  1. Load throttles
  2. Unsuccessful request percentages
  3. High latency
  4. Functional inconsistencies

Cutover validation

Before we do a real cut over for consumers, one must ensure that the alll required validations are complete and we are good to take the real traffic.

  1. Functional validation: Functionally validate the new system meets all thee requirement of the old system.
  2. Operational validation: Ensure the new system meets or perform better on the operational metrics such performance numbers, serving load etc.
  3. Infrastructural validation: Ensure the transparent cut over infra is working as expected without disturbing the other components of the system.

Signoff

To call a migration successful, it is important to take timely sign-off from the stakeholders. One approach of doing so, is to take incremental sign-off. What this means is that after transparently cutting over each consumer, the relevant stakeholders should be informed after successful validation and asked to do a sign off.

This helps in ensuring that none of the sign-offs are missed and also to avoid any discrepancy from either side later in the cycle.

Migrations are important, they’ll help you to be upto date with latest tech. Enjoy your migrations and continue to delight your customers.

--

--