Quick Verification and Monitoring

12009

Overview

The monitoring phase collects data, summarizes results, promptly identifies issues in the production system, and detects abnormal fluctuations in business metrics, allowing for appropriate responses. It is also one of the most important data sources for the team’s decision-making. Only when the data is reliable and error-free can the team make reasonable analyses and judgments based on it. If there is data loss or errors, the team is likely to make incorrect decisions, missing out on business opportunities. Additionally, to promptly identify stability issues and hidden dangers in the production environment, it is essential to quickly collect and analyze system operation monitoring data to locate problems.

To ensure that the required data is collected in a timely manner, the team must discuss and determine the data requirements at the beginning of the verification process, define data requirement specifications early, establish logging standards, create data log metadata, and implement them alongside the corresponding functional requirements. Otherwise, even if the relevant functional features are launched, the corresponding data cannot be obtained, delaying decision-making.

Monitoring Scope

The monitoring scope of the production environment includes three levels: basic monitoring, application monitoring, and business monitoring. Although the methods of data collection vary according to the characteristics of each level, the processing flow is fundamentally consistent. Each monitoring system includes several stages: data collection, reporting, organization, analysis, presentation, and decision-making.

Backend Service Monitoring

Backend service monitoring includes three levels: basic monitoring, application monitoring, and business monitoring.

  1. Basic monitoring assesses the health of the system infrastructure, including monitoring network and server nodes. The monitoring content includes network connectivity and congestion status, CPU load, memory usage, and external storage space usage, among others.
  2. Application monitoring assesses the operational health of applications, such as whether application processes exist, whether they can provide external services normally, whether there are functional defects, whether they can connect to the database normally, whether there are timeout issues, whether exceptions and alerts are thrown by the service, and whether timely scaling can be performed to handle sudden increases in requests.
  3. Business monitoring assesses the health of business metrics. For example, for an e-commerce website, this should include, but is not limited to, real-time user traffic, page views, conversion rates, order volumes, and transaction amounts.

Monitoring of Distributed Software

Since the environment in which distributed software operates is uncontrolled, monitoring their services is subject to objective constraints. Generally, we need to collect the operational status of our software and the host device under user authorization and periodically send the collected data to the backend server for analysis and presentation. If the host device is offline, the software needs to cache data locally and upload it once it is back online.

Similar to backend service monitoring, monitoring of distributed software also includes three monitoring levels. Basic monitoring assesses the operational status of the foundational environment in which the software runs (such as mobile device models, operating systems, memory, etc.) and its connection to the server. Application monitoring assesses the health status of the software application itself (such as memory usage, program crashes, unresponsiveness, communication with the backend server, etc.). Business monitoring assesses user usage data, such as the current page, time spent, and user actions.

Of course, monitoring of distributed software is not limited to reports from the software itself; companies should also pay attention to information on the internet, such as ratings from mobile software application markets and user reviews about the software on various new media platforms, to comprehensively collect information from broader channels, timely rectify defects and vulnerabilities in the software, and provide users with the best experience.

Issue Handling

Once a data monitoring system is established, the next step is to identify problems from the monitoring data and resolve them quickly. There are typically two ways to identify issues: manual judgment and automatic machine discovery. Given the large volume of monitoring data, relying entirely on manual processing is unrealistic. Therefore, machines usually first make judgments based on various rules to automatically discover as many suspected issues in production as possible. When automatic processing is not feasible, it is treated as an "alert," generating a work order sent to the designated issue receiver.

Alert Overload and Intelligent Management

In a domestic internet company, an operations staff member receives over 6,000 alert messages daily. If each alert needs to be reviewed, an average of 4 alerts must be checked every minute, requiring 24-hour on-call availability. However, despite the overwhelming number of alerts, after any production incident, there are typically two action items for incident review: first, to sort through current log monitoring and alert points, ensuring all relevant personnel are configured to avoid missing anyone; second, to add more monitoring points and alerts.

On one hand, there are too many alerts, and there is a desire to reduce them; on the other hand, there is a fear of missing alerts when incidents occur, leading to the addition of more alerts. Ultimately, the winners are usually those who opt for the latter. In fact, a significant portion of alert information is often ignored by recipients, who may not even glance at it. The reasons are primarily twofold:

  • The first handler of the alert is not the recipient.
  • The alert is a preliminary alert and does not require immediate action.

Of course, we do not dismiss the accuracy and authenticity of many alerts, but we also need to improve the quality of alert information in two other dimensions. First is timeliness, which needs no explanation; second is the operability of alert information. This means that upon receiving an alert, the recipient should be able to take corresponding actions; otherwise, the alert information becomes akin to spam messages, which should be filtered out as they decrease work efficiency. Moreover, if genuine alert information is drowned in a sea of unnecessary "false" alerts, it can easily lead to production incidents. We can alleviate the "alert overload" issue from four aspects:

  • Use correlation analysis to bring monitoring points closer to the problem occurrence.
  • Set reasonable alerts through dynamic threshold settings.
  • Regularly review alert settings to clear unnecessary alerts.
  • Use artificial intelligence to dynamically resolve alerts.

Through continuous efforts, we can control the number of alerts at a certain level, but it is challenging to eliminate them entirely. For common alerts, we may already have countermeasures; the real challenge lies in addressing those unusual alerts that have not appeared before, as they are likely to indicate "production issues."

Issue Handling as a Learning Process

Handling "production issues" is also an important part of the product development management process. Without a good handling process, more management problems are likely to arise.

When the team is small, this handling process relies on manual execution, mainly through emails, IM tools, or loud verbal communication. Once the team grows and the system becomes complex, this process becomes very time-consuming. For example, due to inaccurate judgment of problem localization, issues often need to be handed off between different teams, frequently losing the context of the problem during handover. To improve efficiency, it is usually necessary to automate as much of the manual parts of this process as possible, including automatic tracking of issue tickets, additional recording of relevant information, measuring the timeliness of the entire handling process, various timely notification mechanisms, and escalation mechanisms for issue feedback. This requires a ticketing system for support. Moreover, when we review issues, having such a ticketing system can provide us with timely and accurate process information.

In many teams, issue reviews are often seen as "blame meetings," creating a tense atmosphere. In a learning organization, issue reviews are a valuable learning opportunity and one of the most effective learning methods. During the review, all relevant personnel come together to compare results, revisit processes, and conduct analyses of gains and losses, summarizing patterns. This is the best mutual learning process, providing everyone with an opportunity for improvement. The patterns summarized from the review serve as a "recipe" for future handling of similar issues, as well as the best knowledge transfer for the organization, maximizing the potential for progress for future team members.

The most important prerequisite for review activities is to have detailed records of the issue handling process and full participation from all parties involved in the process (including those who transferred the issue). For any points of doubt during the review, secondary scenario recreations should even be conducted to obtain better preventive and root cause solutions.