Bit-Array-Based Alternatives to HyperLogLog

Svante Janson, Jérémie Lumbroso, Robert Sedgewick

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present a family of algorithms for the problem of estimating the number of distinct items in an input stream that are simple to implement and are appropriate for practical applications. Our algorithms are a logical extension of the series of algorithms developed by Flajolet and his coauthors starting in 1983 that culminated in the widely used HyperLogLog algorithm. These algorithms divide the input stream into M substreams and lead to a time-accuracy tradeoff where a constant number of bits per substream are saved to achieve a relative accuracy proportional to 1/√M. Our algorithms use just one or two bits per substream. Their effectiveness is demonstrated by a proof of approximate normality, with explicit expressions for standard errors that inform parameter settings and allow proper quantitative comparisons with other methods. Hypotheses about performance are validated through experiments using a realistic input stream, with the conclusion that our algorithms are more accurate than HyperLogLog when using the same amount of memory, and they use two-thirds as much memory as HyperLogLog to achieve a given accuracy.

Original languageEnglish (US)
Title of host publication35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, A of A 2024
EditorsCecile Mailler, Sebastian Wild
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959773294
DOIs
StatePublished - Jul 2024
Event35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, A of A 2024 - Bath, United Kingdom
Duration: Jun 17 2024Jun 21 2024

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume302
ISSN (Print)1868-8969

Conference

Conference35th International Conference on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms, A of A 2024
Country/TerritoryUnited Kingdom
CityBath
Period6/17/246/21/24

All Science Journal Classification (ASJC) codes

  • Software

Keywords

  • Cardinality estimation
  • Hyperloglog
  • sketching

Fingerprint

Dive into the research topics of 'Bit-Array-Based Alternatives to HyperLogLog'. Together they form a unique fingerprint.

Cite this