Cache penetration
Problem Description
The second classic problem is cache penetration. Cache penetration is an interesting problem. Because the probability of cache penetration is very low, it is generally difficult to detect. But once you find out, and it’s not a small amount, you can have a busy night in no time. Because for normal access, even if the accessed data is not in the cache, it can be loaded back into the cache through DB. Cache penetration means that a special visitor is querying a non-existent key, so that each query will penetrate to the DB. If this special visitor controls a batch of broiler machines and continues to access keys that do not exist in your system, It will put a lot of pressure on the DB, thus affecting the normal service.
Cause Analysis
The reason why cache penetration exists is because when we design the system, we give more consideration to normal access paths, and we lack consideration for special access paths and abnormal access paths.
The normal path of the cache access design is to access the cache first, check the DB after the cache miss, and return to the cache after the DB query obtains the result. This is no problem for normal key access, but if the user accesses a non-existing key, and the query DB returns null (ie, a NULL), the null will not be written back to the cache. After that, no matter how many times this non-existent key is queried, there will be cache misses and DB will be queried. The entire system will degenerate into a “front-end + DB” system. Since the throughput of the DB is only below 1%~2% of the cache, if there are special visitors who access these nonexistent keys in large numbers, the performance of the system will be serious. Degraded, affecting normal user access.
Business scene
There are many business scenarios for cache penetration, such as accessing users through a non-existent UID, and viewing ticket information through a non-existent train ID. User input errors, occasionally a few such requests are not a problem, but if it is a large number of such requests, it will have a great impact on the system.
solution
So how to solve this problem?
The first solution is to check the DB for the first time when querying these non-existent data. Although the result is not returned NULL, the key is still recorded in the cache, but the value corresponding to the key is a specially set value.
The second solution is to build a BloomFilter cache filter to record the full amount of data, so that when accessing the data, you can directly determine whether the key exists through the BloomFilter. If it does not exist, you can directly return it, without checking the cache and DB.
However, these two schemes still have some pits to pay attention to in the design.
For scheme 1, if special visitors continue to access a large number of non-existing keys, even if these keys only have a simple default value, they will occupy a large amount of cache space, resulting in a decrease in the hit rate of normal keys. Therefore, a further improvement measure is to store these non-existing keys for a short period of time and let them expire as soon as possible; or store these non-existing keys in an independent public cache, and when looking up from the cache, first check the normal cache components , if it is missing, check the cache of the public illegal key, if the latter hits, return directly, otherwise penetrate the DB, if it is found to be empty, it will go back to the illegal key cache, otherwise it will go back to the normal cache.
For scheme 2, BloomFilter needs to cache the full amount of keys, which requires that the number of full keys is not large, and it is best to have less than 1 billion pieces of data, because 1 billion pieces of data will occupy about 1.2GB of memory. You can also use BloomFilter to cache illegal keys. Every time a key is found to be an illegal key that does not exist, it will be recorded in BloomFilter. This recording scheme will cause the key stored in BloomFilter to continue to grow rapidly. In order to avoid recording too many keys and causing errors The judgment rate increases and needs to be cleared periodically.
BloomFilter
BloomFilter is a very interesting data structure. It can not only block illegal key attacks, but also judge massive data with low cost and high performance. For example, a system with hundreds of millions of users and tens of billions of news feeds can use BloomFilter to Determine if a user has read a news feed. Let’s make an analysis of the BloomFilter data structure, as shown in the following figure.
The purpose of BloomFilter is to detect whether an element exists in a collection. Its principle is to use a bit data group to represent a set, and perform multiple different Hash tests on a key. If the bits corresponding to all Hash are 1, it indicates that the key exists with a high probability, and the average single record occupies 1.2 words. Sections can reach 99%, as long as the bit corresponding to Hash is 0 once, it means that this key definitely does not exist in this set.
The algorithm of BloomFilter is to first allocate a memory space as a bit array, and the initial value of the bit bits of the array is all set to 0. When adding elements, k independent Hash functions are used to calculate, and then all K positions of the element Hash map are set. is 1. When detecting the key, the k Hash functions are still used to calculate k positions. If the positions are all 1, it indicates that the key exists, otherwise it does not exist.
The advantage of BloomFilter is that it operates in full memory and has high performance. In addition, the space efficiency is very high. To achieve a false positive rate of 1%, an average single record occupies 1.2 bytes. Moreover, every 0.6 byte increase in the average single record can also make the false positive rate continue to be 1/10 of the previous one, that is, the average single record occupies 1.8 bytes, and the false positive rate can reach 1/1000; the average single record occupies 2.4 words section, the false positive rate can reach 1/10000, and so on. The false positive rate here refers to the probability that BloomFilter judges that a key exists, but it does not actually exist, because it stores the hash value of the key, not the value of the key, so there is a probability that such a key exists, and their contents are different , but the Hash value after multiple Hash is the same. For the key that BloomFilter judges does not exist, it is 100% non-existent. Contradictory method, if this key exists, the corresponding Hash value position after each Hash must be 1, not 0.