View profile

🐰 #35 Quick and Dirty Data Platform Guide, Dbt what’s the hype? TimescaleDB on COSS business models; ThDPTh #35 🐰

Three Data Point Thursday
🐰 #35 Quick and Dirty Data Platform Guide, Dbt what’s the hype? TimescaleDB on COSS business models; ThDPTh #35 🐰
By Sven Balnojan  • Issue #35 • View online
How to build a data platform with montecarlo, what the hype about dbt is really about, and exploring COSS business models with timescaleDB.
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, read it.

(1) The Quick and Dirty Guide to Building Your Data Platform
The team at montecarlodata shares their 6-layer set up for a data platform. We’re talking about a platform that’s targeted at end-users, these days that’s an important piece of information to keep in mind. Their setup is
  1. data ingestion
  2. data storage & processing
  3. data transformation & modeling
  4. business intelligence & analytics
  5. data observability
  6. data discovery
They list the most popular tools in each space making the guide pretty actionable. I also agree that both data observability and discovery should be part of the basic stack. Sadly they don’t provide good tools in this space besides montecarlo, so I’ll give a few. For data catalogs & discovery, good options are Amundsen, Atlas, and DataHub. For data observability, things get a little bit trickier. Your best option in my opinion is to combine:
  1. Thorough testing on the development pipelines (i.e. unit tests with fixed data sets)
  2. with thorough testing on the value pipeline (i.e. running great-expectations once every 20 mins for a good set of tests)
  3. and a typical CI tool, and pipeline monitors.
This will provide you with decent observability without investing into large enterprise solutions.
(2) Dbt: What's the hype about?
Oliver Molander did a nice analysis about Dbt and analytics engineering in general. I leave it to you to read the piece, but I cannot without commenting on one quote:
“Bob Muglia, former CEO of Snowflake and investor & board member of Fivetran, sees that in a long-term perspective, data lakes won’t have a place in the modern data stack. Muglia underlines that you have to look at the evolution of how infrastructure changes over time to take on new capabilities. He predicts that five years from now, data is going to sit behind a SQL prompt by and large, and then over time evolve into relational. He predicts that relational will dominate and SQL data warehouses will replace data lakes.”
I think Bob Muglia is completely wrong. I think data lakes and data warehouses will converge in the future. Data lakes will keep the separation of compute and storage, which are the new and shiny thing in Redshift + Snowflake but have always been there on the data lake side. I think the future will be about 98% data without structure, which means SQL will play a much less important role. It will be there to model the few key structured sets, but a lot of the emerging data is real-time and diverse in structure, so SQL as we know it will not be the key player in the future of data.
This still means a good spot for the analytics engineer, the one who combats and takes the few major important data sets and models them, but it also means that the rest has to be dealt with!
So no, I don’t think Dbt will be bigger than spark, there will very likely be some new players entering this zone, but Dbt with its clear-cut vision will stay focused on the niche it has.
🔮🔮🔮 Data Company Corner 🔮🔮🔮
Stuff that might be interesting for anyone at the front line of the data world, inside a data company, inspired by much positive feedback from my article on commercial open source software data companies.
(3) TimescaleDB on OS Business Models
TimescaleDB, being a commercial open source company, apparently spends a lot of time thinking about how to make money from open-source software. I really like the two key insights they post at the beginning of their analysis of open source business models. They are:
  1. Before commercializing, you’ll need broad adoption
  2. Before commercializing, you’ll need prime credibility
I recommend the read as a beginning and already based a couple of follow up articles on parts of this article, which you could follow up with afterward (How to build the next mega open source project, how to become the next 30 billion $$$ data company, and 7 models to build & price COSS products).
🎄 In Other News & Thanks
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Did you enjoy this issue?
Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue