Thoughtful Friday #7: FastCI - The Missing CI/CD Tool for the emerging Data & Software World





Subscribe to our newsletter

By subscribing, you agree with Revue’s Terms of Service and Privacy Policy and understand that Three Data Point Thursday will receive your email address.

Three Data Point Thursday
Thoughtful Friday #7: FastCI - The Missing CI/CD Tool for the emerging Data & Software World
By Sven Balnojan  • Issue #63 • View online
I’m an evangelist for DevOps practices and carry a deep passion for data.
I also think that in the future code will increasingly be written by machines, aided by humans.
But at the intersection of this, I find myself wondering whether there isn’t something missing. In particular, it seems to me that the concept of Continuous Integration & Continuous Delivery, as it is currently in place in the software world, needs an update for the data world. That update is what I call “fastCI”. And I believe, carrying fastCI back to the software world will yield enormous benefits as well, as we advance into a world where code is written more by machines than by humans, and data is integrated deeply into most software pieces.

What is Continuous Integration & Continuous Delivery?
At the heart of the two concepts of CI/CD is the idea of putting up a pipeline with a sequence of steps. These steps assemble code and test it, possibly deploy it into a sandbox environment, integrate it into the existing code base, etc. 
The two key points of CI/CD are:
1. It’s a sequence of steps, just like an assembly linear.
2. The sole goal, just like for an assembly line, is to produce the final output (the running piece of software) with a guaranteed level of quality.
Can I do CI/CD for data?
Of course, you can carry over the same concepts to data, ingest it, then run it through a series of tests and deployments to make sure it works right and does not break any other parts of your system, and only afterward release it.
I even wrote an article about that, “How to get to zero defects for analytical datasets”.
But the key point is: This takes time! Anywhere from 1-2 to 15-20 minutes for a complete pipeline to deploy your thing into production. And you’ll need that time for software and data alike.
What is Wrong With CI/CD for data?
Nothing is wrong with this approach to CI/CD for data, as long as the data shares similar characteristics to current day software which coincidentally they also share with cars:
1. A defect sucks (software = people cannot buy stuff, car = people die, data = the BI system crashes). This translates to high costs of defects.
2. The value of deploying 1-20 minutes earlier is low.
To be precise, we should have a calculation in our heads like “value(having it earlier) - costs of defects * prob(defects)”. This is currently negative for the cases we think of, so we choose to let the pipelines do the work.
Data is different, and Software will be too
The thing is, data actually is different in my opinion. Because real-time data will become more and more important over time, the value of having it “earlier”, earlier in machine learning systems, in decision systems will increase over time. By a lot.
The second thing is, at least for data, we can do very simple things to decrease the two components of (1). 
So, what do we do instead of that “pipeline” which runs for 1-20 mins?
A different World - FastCI
I like to think about a very different concept: fastCI. It’s a simple approach:
1. Do a 0,5 sec “statistical screening”
2. Deploy & deliver into production first!
3. Do more thorough testing afterward in production. Possibly even just statistical as well.
The idea of 2, is to increase the value of what we got. 
The idea of 1 is to do some statistical screening to screen for “really bad breakdowns”. We’re bringing down the prob(Defects) * cost of defects this way. 
The idea of (3) is then simply to catch the not so bad stuff afterward.
How to make this work
To make this work, you will have to figure out (1). You can also do piece-wise roll-outs in (2) minimizing the “blast radius” while you do (3). 
Both will result in much quicker delivery times for data-related things, and fewer defects. 
FastCI for Software
The thing is, this is just a different approach to CI/CD, one focused on statistical testing & a deploy first mentality, not a “100% guaranteed bug-free” guarantee, which obviously isn’t true anyway. 
This leads me to think that this approach is actually also very reasonable for software components. Especially if you think about real-time updating recommendation engines which really would like to act on every single fluctuation, but for that, they need a generic framework that is fast and statistical, which deploys first and then tests.
As software integrates more and more data, the value above calculated for software will change.
And as software gets written quicker and in smaller iterations by machines, not so much by humans anymore, the value above changes as well.
Software Components Aren’t Cars
What makes me believe that there is something to gain in the statistical approach in general for software is the idea that software components simply aren’t cars.
The idea of a “unit test” is a lovely thing. If it fails, we know for sure something is broken. If it passes, we know for sure, that part works. But the key point is, we do not know whether we’ve covered all areas with tests. So at the end of the day, we just know that with 95% confidence, we have covered everything important with tests and that these tests pass in 100% of the cases we defined. Which makes us know with 95% confidence that the software works.
So why do we insist on having a test that is 100% correct and possibly takes up quite a bit of time to run, if we could probably find a simple way of getting to 95% confidence by some other measure?
The difference about the cars is pretty clear: There we simply do not have the luxury of 100%. No test on a car is a 100% guarantee that something works, ever.
So, have you seen this “fastCI” already out there? 
Neither data nor software components is cars. They have stuff in common, but they also differ. It might be time to focus on the differences. 
I’d love some feedback if you feel like it! This article is …
Did you enjoy this issue?
Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue