HyperLogLog

HyperLogLog: Randomized Algorithm approach for Counting Distinct Items

448

Computer Science

Technology

In Computer Science, there are generally five different kinds of algorithms. They are Brute Force algorithms, Greedy algorithms, Divide and Conquer, Dynamic Programming, and Randomized algorithms. Most of the problems can be solved using the first four types, but there can be more efficient algorithms for the same problem using the randomized approach. Hyperloglog is a randomized algorithm for counting distinct items. It is useful when the cardinality of these items can be very large, and other approaches may lead to too much time or too much space being taken. Facebook introduced the algorithm, and is now a standard algorithm for related problems. The counting-unique problem appears quite frequently in the real world. Social media sites need to keep track of the number of unique visitors on their sites; in ML, it can be used for estimating highly connected clusters by estimating the number of edges in a graph, etc. <b><h2><center>Brute Force approach</b></h2></center> The problem we are trying to solve is to count the number of distinct items. The most straightforward approach for the problem would be to use a stack and push a new item whenever we encounter one. Let's say we already have a list of items, and some items might be repeated. We check the first item and add it to the empty stack. Henceforth, we check the stack until the end, and if the item doesn't appear, we add it to the stack. We can count the number of unique items in the stack to find the number of unique items. The approach is simple and works perfectly if the domain of items is minimal. When we analyze the runtime for creating the stack containing the unique items, we can see <b>it would take O(n^2) time to create the stack</b>. The runtime for counting the number of items in the stack is O(n) (if you are not storing the stack size in a different variable). Hence, the overall runtime will be O(n^2). You might have guessed that this is not good for large stack sizes. <b><h2><center>Hashing Technique</b></h2></center> A much faster and more efficient method would be to use hash tables for the items. So, how does hashing work? In short, when we pass an item through a "special" function called a <b>hash function</b>, we get an output that is unique to that item, such that whenever we pass that item through the hash function, we always get the same result which can not be repeated for any other input. We can use hash functions to map the item (input) to an integer and store it at that index. This approach's runtime would be O(1) (to add a new element to the hash table, O(n) when going through a list). However, <b>this approach doesn't scale well</b>. The method might work better if the number of distinct items is known beforehand and is not very large. But that is rarely the case. The hash tables can take up a lot of space and also suffer from collision. Using better hash functions, we can reduce the probability of collisions, but it cannot be removed. <b><h2><center>Randomized Algorithms</b></h2></center> Randomized algorithms rely entirely on probability for their efficiency. This means that randomized algorithms always do not guarantee a fast result (and in some cases, like in HyperLogLog, an accurate result). Still, <strong>based on their runtime analysis, we can say that the appearance of the worst-case scenario for the randomized algorithms is quite rare</strong>. One of the most famous randomized algorithms is the Randomized Quick Sort, considered the fastest sorting algorithm. HyperLogLog is a randomized algorithm. In the use cases mentioned in the introduction, or in general, we rarely need an accurate count. A reasonable estimate for the number of active users would also work for a social media company. The HyperLogLog (HLL) algorithm does the same. It doesn't accurately measure the distinct elements but gives an estimate. The HLL algorithm is still an active field of research, and there have been many better variations over the years. The HLL algorithm can estimate cardinalities beyond 10^9 with a 2% standard error while only using 1.5kb memory. <b><h2><center>LogLog</b></h2></center> In 1984, <i>Philippe Flajolet and G. Nigel Martin</i> introduced a fast and storage-efficient algorithm for counting the unique items in a list called LogLog. This algorithm evolves into HyperLogLog. The algorithm basically works like this: when an item is hashed into a binary form, the number of contiguous 0s will give the estimate of the number of unique items in the list. A simple example is if we take the first eight binary numbers, all with three digits, then the probability of picking the number with three continuous 0s (000) from 8 distinct numbers is 1/2^3. We can also state that if we choose the number 000 (three 0s), the number of distinct numbers encountered is 2^3. So, let's say we encounter the first item. It is hashed into a binary form (in the form of 0s and 1s), and then we see the largest sequence of 0s in the output. We assume that the number of unique items encountered till now is 2<sup>n</sup>, where n is the length of the largest sequence of 0s. As and when we encounter a larger sequence of 0s for future items, the estimated number of unique items is changed with this new value. However, simply using 2<sup>n</sup> adds bias to the estimate and gives a result with a higher margin of error. Hence, based on statistical analysis done by mathematicians, a correction factor is used, and the original estimate is divided by this <strong>correction factor (<span>&#632;</span>) of 0.77351</strong>. This implies the estimate is considered to be 2^n/<span>&#632;</span> It works on the idea that it's more challenging to find a large sequence of 0s in the binary hash value early in the series, but as we try more new items, we might find a larger sequence of 0s. But what if we get an item whose hash value contains a large sequence of 0s at the start? As the sequence of 0s depends on the hash function, we can use multiple hash functions and find the average value of the number of 0s and in all the hash values for the same item and use that average as the estimate for the number of unique items. For example, if we obtain the longest sequence of leading zeroes using m different hash functions, here we denote the value of the longest series of leading zeroes as L<sub>1</sub>, L<sub>2</sub>,..., L<sub>m</sub>, then our final estimation becomes m * 2<sup>((L1+...+Lm)/m)</sup>/<span>&#632;</span>. It was noticed that the more the number of independent hash functions, the better the accuracy. <b>The error rate varied with m with the factor of 1.3/<span>&#8730;</span>m</b>. <b><h2><center>SuperLogLog</b></h2></center> In the same paper in which Philippe Flajolet introduced the LogLog algorithm, he and Durand also introduced the </b>SuperLogLog algorithm</b>. It works the same as the LogLog with a slight difference. They found out that the accuracy of the estimated unique item length can be improved mainly by not considering a few of the largest lengths of the sequence of 0s received from all the hash functions. They found out that by considering the <i>smallest 70% values of the length of 0 sequences</i> obtained by the independent hash functions for the same input, the <strong>accuracy improved from 1.3/<span>&#8730;</span>m to 1.05/<span>&#8730;</span>m</strong>. The new formula, according to SuperLogLog, becomes, <center>estimated unique items = 0.7*m * 2<sup>((sum of smallest 70%(L1,...,Lm))/0.7*m)</sup>/<span>&#632;</span></center> <b><h2><center>HyperLogLog</b></h2></center> In 2007, Philippe Flajolet improved his algorithm's results a bit more by introducing the HyperLogLog algorithm. Instead of considering the geometric mean (as being used in LogLog and SuperLogLog), he replaced it with <b>harmonic mean</b>. One of the great properties of harmonic means in such situations is that it can suppress the effects of outliers. The equation to estimate the number of unique items becomes, <center>m*HM(2<sup>Li</sup>)/<span>&#632;</span>,</center> Where HM is the harmonic mean function and, 'm' is the number of hash functions, L<sub>i</sub> is the length of the largest sequence of 0s in the hash value. HyperLogLog achieves an error rate of 1.04/<span>&#8730;</span>m, which is the lowest among all. <b>HyperLogLog has a similar runtime as the hashing technique discussed above, but it also takes up much less space</b>. If your task doesn't require you to find the accurate measure of the number of unique items, HLL is an excellent choice. <b><h2><center>Examples</b></h2></center> Let's assume we used four hash functions and received four values corresponding to each function. Let the largest sequence of 0s in each hash value be 1, 2, 4, 100. Using LogLog, the estimated unique items would be in the order of 100 million. If we use SuperLogLog, 100 will be ignored (the smallest 70% used), and the estimated number of unique items would be 18. Used HyperLogLog, the estimated number of unique items would be 25.

- Ojas Srivastava, 01:28 AM, 19 Jul, 2023

Randomized Algorithm