I applaud every effort to create new technologies in the data space, I think much more innovation has to happen here and that weāre basically in the stone ages of data.Ā
Firebolt is a very interesting project. I spend some time digging through all material provided by Firebolt, Snowflake, and Redshift. Firebolt basically claims to be a lot faster than Snowflake.
ā¦
Hereās my short & very simplified perspective on the Redshift-Snowflake-Firebolt trio:
The short version: Postgres, Redshift, Snowflake & Firebolt mainly differentiate themselves by focusing on different questions. Each question emerged after the previous one had been āsolvedā. But nothing is actually stopping Redshift of Snowflake from solving Fireboltās question as well. Indeed as far as I can tell, Snowflake has 99% of the technology Firebolt is currently using in place with one difference, the lack of ānative nested array storageā.Ā
The Redshift insight: Redshift realized, in my words, that analytical data is becoming a thing, so reading speed is essential, thus columnar storage and query result caches for databases were born.
But that wasnāt enough for Snowflake, they realized analytics workloads need more than the traditional modelā¦.
The Snowflake insight: Compute & Storage should scale independently, because for analytical workloads, for important stuff, we simply want to be able to throw money at the problem and make it faster, no matter the amount of data.
The Firebolt founders again realized, now that we scale compute & storage individually, now that the data isnāt stored where the computation happens, something else becomes key to analytical workloadsā¦.
The Firebolt insight: With cloud and separation of compute & storage, the key problem is to reduce the amount of data moving between the distributed storage spaces and the compute instance.Ā
Firebolts idea is important, data and computing ability are both growing exponentially, they will likely grow in parallel, so this problem isnāt going away. So what is the key point here? If I submit a computation that needs data, the now crucial step is to determine which data is needed. Thatās of course not completely obvious until weāve completed the computation, hence the dilemma of the computation & storage separationā¦. So something has to travel through the network. Firebolt does a good job at reducing that by three main things:
- They own filesystem (F3) and excessive use of indices (which basically tells us which data is where)
- Their support for nested JSONs. Basically, if you have a nested array {A:1; B:{X:Y}} what they will do is to store {X:Y) in a separate table making it much easier to again use indices.
- Lots and lots of query optimizations. Why is this so important? This is really the step that actually makes sure less stuff is transmitted over the wire!
And thatās it. My key takeaway after going through all the marketing material of both Firebolt & Snowflake is that these are the only three unique things they got going. (1) actually isnāt unique, Snowflake seems to employ a very similar strategy here, especially with regards to the sizes of partitions.Ā
Summary: So what does that leave us at? In my opinion, two things will happen: Snowflake might catch up to Firebolt in the speed comparisons, from the publicly available it seems they are only missing the nested JSON support and maybe some query optimization (focused on retrieving less data). Second, I donāt see any open-source innovation in this space which is so prone to exactly that. So Iām betting on an open-source analytical database that is able to take on both Firebolt & Snowflake to emerge sometime soon.Ā