Cache Avalanche
Problem Description
The third classic problem is cache avalanches. During system operation, cache avalanche is a very serious problem. Cache avalanche refers to the situation where some cache nodes are unavailable, resulting in the unavailability of the entire cache system and even the service system. Cache avalanche is divided into two cases according to whether the cache is rehash (that is, whether it drifts):
1. The cache does not support the system avalanche caused by rehash is unavailable
2. The cache avalanche caused by cache support rehash is unavailable
Cause Analysis
In the above two cases, the avalanche generated when the cache does not perform rehash is generally due to the unavailability of many cache nodes, the request penetration causes the DB to be overloaded and unavailable, and finally the entire system avalanche is unavailable. The avalanche generated when the cache supports rehash is mostly related to the traffic peak. The traffic peak arrives, causing some cache nodes to overload and crash, and then spread to other cache nodes due to rehash, and finally the entire cache system is abnormal.
The first case is easier to understand. The cache nodes do not support rehash. When many cache nodes are unavailable, a large number of cache accesses will fail. According to the cache read-write model, these requests will further access the DB, and the DB can carry far more access than The cache is much smaller and the request volume is too large, which can easily cause DB overload, a large number of slow queries, and eventually block or even crash, resulting in service exceptions.
What about the second case? This is because in the cache distribution design, many students will choose the consistent Hash distribution method. At the same time, when some nodes are abnormal, the rehash strategy is adopted, that is, the abnormal node requests are evenly distributed to other cache nodes. In general, the consistent hash distribution + rehash strategy can work well, but when a large traffic peak is approaching, if the large traffic keys are concentrated, just in a certain 1 to 2 cache nodes, it is easy to The memory and network card of these cache nodes are overloaded, and the cache nodes crash abnormally. Then these abnormal nodes go offline. These high-traffic key requests are rehashed to other cache nodes, which in turn causes other cache nodes to be overloaded and crashed. The cache exception continues to spread, and eventually As a result, the entire cache system is abnormal and cannot provide external services.
Business scene
The business scenario of caching avalanche is not uncommon, and systems such as Weibo and Twitter have encountered many times in the first few years of operation. For example, at the beginning of Weibo, many business caches adopted the consistent Hash+rehash strategy. When sudden flood traffic came, some cache nodes were overloaded and crashed or even crashed, and then the requests of these abnormal nodes were transferred to other cache nodes, which in turn caused other cache nodes. The overload is abnormal, and eventually the entire cache pool is overloaded. In addition, the power failure of the rack caused multiple nodes of the business cache to go down, and a large number of requests were directly sent to the DB, which also caused the DB to be overloaded and blocked, and the entire system was abnormal. Finally, after the cache machine is powered on, the DB is restarted, and the data is gradually heated before the system gradually returns to normal.
solution
To prevent cache avalanches, here are 3 solutions.
Option 1: Add read and write switches to the access to the business DB. When it is found that the DB request becomes slow or blocked, and the slow request exceeds the threshold, the read switch will be turned off, and some or all requests to read the DB will failfast and return immediately, waiting for the DB to recover Then turn on the read switch, as shown in the figure below.
Option 2: Add multiple copies to the cache. After the cache is abnormal or the request is missed, other cache copies are read, and multiple cache copies are deployed in different racks as much as possible, so as to ensure that the cache system will normally provide services to the outside world under any circumstances. .
The third solution is to monitor the cache system in real time. When the slow ratio of the requested access exceeds the threshold, it will alarm in time, and restore it in time through machine replacement and service replacement; it can also automatically close abnormal interfaces, Stop edge services and stop some non-core functional measures to ensure the normal operation of core functions in extreme scenarios.