Citus Con 2023 is a wrap! 🎁 Thanks for joining the fun. Missed it? You can still watch all 37 talks online 🖥
In an earlier blog post I wrote about how breaking problems down into a MapReduce style approach can give you much better performance. We’ve seen Citus is orders of magnitude faster than single node databases when we’re able to parallelize the workload across all the cores in a cluster. And while count (*)
and avg
is easy to break into smaller parts I immediately got the question what about count distinct, or the top from a list, or median?
Exact distinct count is admittedly harder to tackle, in a large distributed setup, because it requires a lot of data shuffling between nodes. Count distinct is indeed supported within Citus, but at times can be slow when dealing with especially larger datasets. Median across any moderate to large size dataset can become completely prohibitive for end users. Fortunately for nearly all of these there are approximation algorithms which provide close enough answers and do so with impressive performance characteristics.
Keep reading