View profile

Thoughtful Friday #9: DataOps, DataMesh, The Missing Piece

Three Data Point Thursday
Thoughtful Friday #9: DataOps, DataMesh, The Missing Piece
By Sven Balnojan  • Issue #69 • View online
Why both don’t really matter if we don’t solve the underlying problems first. 

(from Shel Silverstein, all of the books are great!)
(from Shel Silverstein, all of the books are great!)
“How do you know that Microsoft isn’t just going to bundle a browser into their product?”
I said, really just to end the conversation, “Gentlemen, there’s only two ways I know of to make money: bundling and unbundling.” And said, “We’ve got an airplane to catch.” And we left, and Peter Currie, walking out the door, said, “Those people are looking at you, Barksdale, like you’re crazy. What did you just say?” And I said, “Well, best I can tell, most people spend half their time adding and other people spend half their time subtracting, so that’s what works out.”
Weirdly some confluence happened last week for me. A lot of talk about “bundling & unbundling” happened publicly, I had a quick Twitter exchange with Meltano and separately got interviewed about Data Meshes by an M.Sc. student from the TU Berlin. 
Comparison with DevOps
There is very likely still a piece missing right at the heart of our data work. It seems to be independent of whether we talk about Data Mesh, DataOps, or just our regular data work. 
Let’s look at the DevOps paradigm to understand this.
In DevOps, in my opinion, there is only one truly important concept, that of the pipeline. The linear, one stream, batch-processing pipeline. The pipeline is almost exactly the same as a car manufacturing line. 
We also have tooling that implements this concept: CI/CD Tools like GitLabCI, that integrate with the only “input” in the form of Git
Finally, we have best practices that link all of these things together. That let us version everything in git, test locally, commit into the CI/CD, promote artifacts to integration testing stages,… and so on and so on.
The Missing Piece in Data
In the data world, however, we’re missing this three-fold package:
1. The underlying concept of how the work flows (we have a good candidate though!)
2. The tooling to implement the concept (we have not a single one, no I am not counting datakitchen into this category, too specific, not generic enough and it’s just one tool)
3. The best practices to make this work (we’re at 5-10% maybe, nowhere near where we need to be).
In the data mesh world, this becomes very apparent once you start to think about the technology platforms. Every single platform involves a good amount of custom software engineering to stitch together separate pieces. That’s not how it is supposed to be! Even worse, in the data mesh world, the candidates for “concepts” below are not even considered part of the discussion. 
Bundling & Unbundling
All of the “unbundling/bundling debate” basically is about the proper tooling. But we’re missing the right dimension of “bundling” because in reality we don’t wanna bundle or unbundle, we want working tools! The question is more, is there a good “purpose” for a good special-purpose tool that happens to ”bundle” wisely.
Let’s try to take a stab at the three questions, even though I certainly have no good answer. And besides, the point of this is, that it’s not about me, or anyone individually. We need the whole package, which is about practical knowledge, practices established inside the community, and put into tools. That’s gonna take time.
1. What are good candidates for the concepts?
Meltano posted a tweet asking what a “DataOps OS” should look like in a picture. That seems to be a question targeted at exactly that, what is the foundation, the underlying concept we want to work with? 
I still am a big fan of the “two orthogonal pipelines concept” which is used by Datakitchen all the time simply because I find it to capture the workflow really well. 
The idea is that in data work we have two pipelines:
1. A usual manufacturing like a batch pipeline with your code, your application, your thingy
2. An orthogonal pipeline with the data. However, this is not a “usual car manufacturing pipeline” but more like the pipeline that works with bulk materials. These are things you cannot count on their own (like sand) and as such you need to resort to statistical process control of all aspects of the pipelines. Additionally, these pipelines run more continuously than in batches.
When the data flows through the system to generate end-user value, they intersect. 
And that frame, these two empty but different pipelines is exactly what I think a DataOps OS should look like. It is also what I think the missing concept could be.
2. What should the tooling look like?
The tooling question is hard, and I have no good answer. I just know what we got right now:
1. The one pipeline is like a traditional software pipeline. So a traditional CI/CD system could work here.
2. The continuous pipeline definitely needs a different thing. I did talk about the “fastCI” concept four weeks ago which I think is needed here, but I haven’t seen any out there yet. Most data tools try to pick that topic up, but not yet in a coherent way. But they are starting to feature aspects of statistical control throughout this process. 
3. I am not sure whether we should use two different tools for just one flow of work. Because at the end of the day, our data apps only deliver value when both pieces come together, our tooling needs to be packaged in some way.
That leaves us with a whole lot of nothing to work with, it seems to me. But yes, a “DataOps OS” would be nice.
3. What purpose should best practices serve?
The goal of establishing best practices is to:
1. Delivering high-quality data software faster
2. Delivering high-quality data quicker into the software
Thus, delivering value through the data software faster. Delivering software slower but data faster, or the other way around will not help us generate more value. 
From the group around Jez Humble et. al. we know that we can weed out the practices to achieve this by looking at the typical manufacturing/ DevOps metrics: 
1. Throughput (for software: deployment frequency, for data: data throughput)
2. Lead Time (for software: lead time for changes; for data: time from data sourcing => data usage)
3. Mean Time to Restore (for software & data alike)
4. Change Rate Failure (for software & data alike)
This makes 8 metrics in total we want to watch out for with our best practices. 
What does that mean? It means testing incoming data first before pushing it into the production system lowers the Change Rate Failure, but it increases the Lead Time. So that only makes sense, if we are able to run tests really quickly. 
It also means adding tests to the data applications is important because it reduces the change Rate failure for software, but if testing is really hard in your current setup, then this increases the lead time for changes and the throughput, both of which are really not good and let the gain of testing go to waste. 
Just as with software, we’re aiming to find best practices that allow us to both, decrease the meantime to restore AND decrease the lead time. 
That still doesn’t leave us with best practices, but I hope it serves as a good direction to finding them. 
Now it’s your turn, reply if you have any good answers, tools, whatever you got. I’d love to hear about them!
And of course, leave feedback if you have a strong opinion about the newsletter! So? 
Did you enjoy this issue?
Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue