We recently redesigned our analytics API from the ground up, in order to provide near real-time analytics to our customers on billions of search queries per day. Here’s how we did it.
This is how the article starts, and I’m already curious :-). Let’s do a quick math. If we need to search on a database 1B queries per day (let’s ignore that you also need to write this data first). That is more than 10k queries per second. How would you approach the problem?
Lately, I have been more interested in databases, and the problem of scaling a data storage. I was not that interested years ago, but in the end, probably a lot of your company relies on some database, because in the end everything you (or your customer) do is stored in one or more databases, and if that data is lost you may lose completely your company (remember the “small” issue GitLab had, where they almost wiped out their master database
Anyway, building a scalable database is often not trivial, and that’s exactly the problem they experienced at Algolia. The needed a way to store a lot of analytics and also retrieve the data in sub-milliseconds queries. In this very well detailed write-up they explain how multiple databases could support this kind of performances, but in the end, they decided to rely on PostgreSQL with Citus Data, with a specific configuration, and with very good results.
If you are interested in databases and how a company can setup it to achieve great results, this is the article for you.