Quick Verification. Operation

12009

Overview

Operation refers to deploying software packages that meet quality requirements into the production environment or delivering them to users, enabling them to provide services. Each time a new version of software is deployed, we do not want to affect the normal use of users. Therefore, how to allow users to complete software version upgrades and updates without perception is one of the most important issues in the internet era. Even the most traditional financial industries are constantly seeking better solutions to optimize user experience while ensuring transaction security and data consistency.

This phase is where the most conflicts occur between the development team and the operations team, and it is also one of the phases with the most repetitive manual operations. In this regard, teams should spare no effort to improve and optimize, liberating people from repetitive physical labor.

High-Frequency Releases

Trends

Currently, leading internet companies around the world are updating their software products in a "frequent release" model. For example, as early as May 2011, Amazon's monthly statistics indicated that software deployment operations were triggered on average every 11.6 seconds, with the highest deployment frequency that month reaching as much as 1079 times per hour. An average of 10,000 servers would simultaneously receive deployment requests, with a peak of 30,000 servers executing a deployment operation at the same time.

In 2017, Facebook pushed multiple deployments to its website daily, while its mobile application client pushed to the app market once a week, releasing the latest internal all-employee experience version daily to internal staff, and pushing Alpha and Beta versions to hundreds of thousands and millions of users.

Of course, not all of these releases are feature releases; they also include defect fixes. With the increase in deployment frequency, as well as the total code volume and submission counts, the number of severe defects has not increased correspondingly. Although the number of medium and low severity defects has risen, considering the growth in the number of developers, the complexity of website systems, and the company's attitude towards software quality, this is an acceptable level.

Benefits and Costs

In a high-frequency release model, the amount of content released each time is usually less than that in low-frequency releases (obviously, the features that can be completed in one day are fewer than those that can be completed in ten days). Why are so many companies moving towards "high-frequency releases"? The benefits of high-frequency releases include the following:

More opportunities to interact with real users, allowing for quick decisions or adjustments to the product's direction.
Since each change is smaller in scale, the software system does not undergo drastic changes, thereby reducing deployment risks.
The cost of a single deployment is reduced and tends to be constant. Frequent deployment operations can be painful, creating motivation to build many automation facilities, thus lowering costs and effort.
Problems are easier to locate and fix, and corrections can be made quickly.

High-performing teams compared to low-performing teams:

Their code release frequency is 46 times higher;
The time from code submission to code deployment is shortened to 1/440;
The average recovery time from failures is reduced to 1/96;
The change failure rate is lowered to 1/5.

These benefits come from mature and automated deployment and release operations. If the R&D management methods used in low-frequency release models are still adhered to, forcibly implementing high-frequency releases will lead to higher iteration costs. For example, if a team originally released manually once a month and now decides to release once a week, without discussing how the quality verification costs of each version would change, assuming the original manual mode is still used, the monthly workload would be four times the original. Moreover, after the release cycle is shortened, operations that did not originally occupy much cost in the work mode (such as compilation time, testing workload, etc.) will become more prominent contradictions.

However, no matter what, we cannot eliminate release risks 100%. What we need to do is continuously seek ways to reduce release risks.

Reducing Release Risks

Blue-Green Deployment

Prepare two completely identical operating environments, with one environment serving as the official production environment providing software services. The other environment serves as the pre-production environment for the new version, where the new version of the software is deployed and acceptance testing is conducted. Once confirmed that there are no issues, traffic is routed to the environment where the new version is located, becoming the official production environment, while keeping the old version's environment unchanged. Until it is confirmed that the new version has no issues, the environment running the old version will serve as the next new version's pre-production environment for deploying future new versions.

Of course, this is a very ideal situation. In reality, the time cost of database replication is high, and the space cost is also significant. Therefore, many blue-green deployment solutions use the same database service, only deploying the software using two different environments, as shown in Figure 12-7. In this case, the same data storage format must be compatible with both the old and new software versions, allowing them to operate on data simultaneously.

Additionally, there is another issue that needs to be addressed in blue-green deployment. That is, when the switch occurs during a user's business operation and involves transaction processing, how to handle data consistency issues. Generally, the switch is not completed instantaneously. During the switching process, new requests are directed to the new version's environment, and access to the old version's environment is no longer allowed. For those old requests that have not yet returned results at the time of the switch, the old version's environment allows them to complete access, after which it will no longer accept new requests.

Rolling Deployment

Select one or more service units from the service cluster, stop the service, perform version updates, and then reintroduce them into use. This process is repeated until all service instances in the cluster are updated to the new version. Compared to blue-green deployment, this method is more resource-efficient because it does not require preparing two identical service operating environments. Thus, the server cost is effectively halved.

When issues arise with the new version, this rolling deployment method cannot simply switch back to the old environment through the front load balancer like blue-green deployment; instead, it must roll back the servers that have already deployed the new version. Another approach is to quickly fix the issue, generate a third version V3, and then immediately initiate a V3 rolling deployment. At this point, there may be three versions V1, V2, and V3 existing in the service cluster.

Canary Releases and Gray Releases

This generally refers to a release method that allows a small number of users to use the new version first to discover potential software issues in advance, thus avoiding harm to a larger number of users. Since only a small number of users are involved, the impact is also relatively small.

The term "canary release" comes from an old practice of miners going underground. In the 17th century, British miners discovered that canaries are very sensitive to gas. At that time, to ensure their safety, miners would bring a canary each time they went underground. If there were harmful gases underground, the canary would die from the gas before humans could detect it. At this point, the miners would know there were toxic gases underground and would immediately stop working and return to the surface.

"Gray release" refers to dividing the release into different stages, with the number of users increasing gradually at each stage. If no issues are found in the current stage with the new version, the user count is expanded to the next stage until it reaches all users. It is an extension of the canary release, and one could say that canary release is the initial level of gray release. The number of stages and the number of users in each stage should be defined based on the product's status.

Dark Deployment

Before officially releasing a function or feature, the first version is deployed to the production environment so that the team can test it and identify potential errors before providing the feature to end users. The "dark" in "dark deployment" refers to the aspect of "user unawareness," which can be achieved through switch technology.

For example, consider the following scenario: an internet company has redeveloped an online news recommendation algorithm, hoping to recommend more excellent content to users. However, due to the complexity of the algorithm, the company wants to know how the algorithm performs under the access of a large number of real users. How should this be done?

We can configure a switch for this algorithm and deploy it to the production environment. When this switch is turned on, traffic will enter this algorithm. However, users do not know whether they are using the old algorithm or the new one. If the performance of this algorithm is not satisfactory, we can immediately turn off this switch, allowing users to use the original old algorithm.

Summary

In some business scenarios, we indeed cannot directly release software frequently. However, if we can continuously deploy and release to the pre-production environment using the methods introduced in this chapter, we can obtain relevant quality feedback on the software as early as possible, thereby reducing risks after the official release. If we can lower the average cost of each release to a sufficiently low level, it will directly change the product R&D process of the team.

## Overview

Operation refers to deploying software packages that meet quality requirements into the production environment or delivering them to users, enabling them to provide services. Each time a new version of software is deployed, we do not want to affect the normal use of users. Therefore, how to allow users to complete software version upgrades and updates without perception is one of the most important issues in the internet era. Even the most traditional financial industries are constantly seeking better solutions to optimize user experience while ensuring transaction security and data consistency.

This phase is where the most conflicts occur between the development team and the operations team, and it is also one of the phases with the most repetitive manual operations. In this regard, teams should spare no effort to improve and optimize, liberating people from repetitive physical labor.

## High-Frequency Releases

### Trends

Currently, leading internet companies around the world are updating their software products in a "frequent release" model. For example, as early as May 2011, Amazon's monthly statistics indicated that software deployment operations were triggered on average every 11.6 seconds, with the highest deployment frequency that month reaching as much as 1079 times per hour. An average of 10,000 servers would simultaneously receive deployment requests, with a peak of 30,000 servers executing a deployment operation at the same time.

In 2017, Facebook pushed multiple deployments to its website daily, while its mobile application client pushed to the app market once a week, releasing the latest internal all-employee experience version daily to internal staff, and pushing Alpha and Beta versions to hundreds of thousands and millions of users.

Of course, not all of these releases are feature releases; they also include defect fixes. With the increase in deployment frequency, as well as the total code volume and submission counts, the number of severe defects has not increased correspondingly. Although the number of medium and low severity defects has risen, considering the growth in the number of developers, the complexity of website systems, and the company's attitude towards software quality, this is an acceptable level.

### Benefits and Costs

In a high-frequency release model, the amount of content released each time is usually less than that in low-frequency releases (obviously, the features that can be completed in one day are fewer than those that can be completed in ten days). Why are so many companies moving towards "high-frequency releases"? The benefits of high-frequency releases include the following:  
- More opportunities to interact with real users, allowing for quick decisions or adjustments to the product's direction.  
- Since each change is smaller in scale, the software system does not undergo drastic changes, thereby reducing deployment risks.  
- The cost of a single deployment is reduced and tends to be constant. Frequent deployment operations can be painful, creating motivation to build many automation facilities, thus lowering costs and effort.  
- Problems are easier to locate and fix, and corrections can be made quickly.

High-performing teams compared to low-performing teams:  
- Their code release frequency is 46 times higher;  
- The time from code submission to code deployment is shortened to 1/440;  
- The average recovery time from failures is reduced to 1/96;  
- The change failure rate is lowered to 1/5.

These benefits come from mature and automated deployment and release operations. If the R&D management methods used in low-frequency release models are still adhered to, forcibly implementing high-frequency releases will lead to higher iteration costs. For example, if a team originally released manually once a month and now decides to release once a week, without discussing how the quality verification costs of each version would change, assuming the original manual mode is still used, the monthly workload would be four times the original. Moreover, after the release cycle is shortened, operations that did not originally occupy much cost in the work mode (such as compilation time, testing workload, etc.) will become more prominent contradictions.

However, no matter what, we cannot eliminate release risks 100%. What we need to do is continuously seek ways to reduce release risks.

## Reducing Release Risks

### Blue-Green Deployment

Prepare two completely identical operating environments, with one environment serving as the official production environment providing software services. The other environment serves as the pre-production environment for the new version, where the new version of the software is deployed and acceptance testing is conducted. Once confirmed that there are no issues, traffic is routed to the environment where the new version is located, becoming the official production environment, while keeping the old version's environment unchanged. Until it is confirmed that the new version has no issues, the environment running the old version will serve as the next new version's pre-production environment for deploying future new versions.

Of course, this is a very ideal situation. In reality, the time cost of database replication is high, and the space cost is also significant. Therefore, many blue-green deployment solutions use the same database service, only deploying the software using two different environments, as shown in Figure 12-7. In this case, the same data storage format must be compatible with both the old and new software versions, allowing them to operate on data simultaneously.

Additionally, there is another issue that needs to be addressed in blue-green deployment. That is, when the switch occurs during a user's business operation and involves transaction processing, how to handle data consistency issues. Generally, the switch is not completed instantaneously. During the switching process, new requests are directed to the new version's environment, and access to the old version's environment is no longer allowed. For those old requests that have not yet returned results at the time of the switch, the old version's environment allows them to complete access, after which it will no longer accept new requests.

### Rolling Deployment

Select one or more service units from the service cluster, stop the service, perform version updates, and then reintroduce them into use. This process is repeated until all service instances in the cluster are updated to the new version. Compared to blue-green deployment, this method is more resource-efficient because it does not require preparing two identical service operating environments. Thus, the server cost is effectively halved.

When issues arise with the new version, this rolling deployment method cannot simply switch back to the old environment through the front load balancer like blue-green deployment; instead, it must roll back the servers that have already deployed the new version. Another approach is to quickly fix the issue, generate a third version V3, and then immediately initiate a V3 rolling deployment. At this point, there may be three versions V1, V2, and V3 existing in the service cluster.

### Canary Releases and Gray Releases

This generally refers to a release method that allows a small number of users to use the new version first to discover potential software issues in advance, thus avoiding harm to a larger number of users. Since only a small number of users are involved, the impact is also relatively small.

The term "canary release" comes from an old practice of miners going underground. In the 17th century, British miners discovered that canaries are very sensitive to gas. At that time, to ensure their safety, miners would bring a canary each time they went underground. If there were harmful gases underground, the canary would die from the gas before humans could detect it. At this point, the miners would know there were toxic gases underground and would immediately stop working and return to the surface.

"Gray release" refers to dividing the release into different stages, with the number of users increasing gradually at each stage. If no issues are found in the current stage with the new version, the user count is expanded to the next stage until it reaches all users. It is an extension of the canary release, and one could say that canary release is the initial level of gray release. The number of stages and the number of users in each stage should be defined based on the product's status.

### Dark Deployment

Before officially releasing a function or feature, the first version is deployed to the production environment so that the team can test it and identify potential errors before providing the feature to end users. The "dark" in "dark deployment" refers to the aspect of "user unawareness," which can be achieved through switch technology.

For example, consider the following scenario: an internet company has redeveloped an online news recommendation algorithm, hoping to recommend more excellent content to users. However, due to the complexity of the algorithm, the company wants to know how the algorithm performs under the access of a large number of real users. How should this be done?

We can configure a switch for this algorithm and deploy it to the production environment. When this switch is turned on, traffic will enter this algorithm. However, users do not know whether they are using the old algorithm or the new one. If the performance of this algorithm is not satisfactory, we can immediately turn off this switch, allowing users to use the original old algorithm.

## Summary

In some business scenarios, we indeed cannot directly release software frequently. However, if we can continuously deploy and release to the pre-production environment using the methods introduced in this chapter, we can obtain relevant quality feedback on the software as early as possible, thereby reducing risks after the official release. If we can lower the average cost of each release to a sufficiently low level, it will directly change the product R&D process of the team.