I just had a discussion with a friend about data version control and he pointed me to this comparison. The author, Guy Smilovsky
provided a decent comparison of data version control tools in 2020 and to my knowledge, not much has changed in the space (sadly). I do still think some major innovation has to take place in this space, but so far really there are only two options for actually versioning data as code:
GFS is built for a different purpose and as such doesn’t work the way we need it to work to e.g. version control your database or the input of your machine learning models. It might still do to manage a large ML model, but that’s about it.
Other solutions like Pachyderm all come packaged with lots of baggage, and so are not really useful as standalone data version control. That leaves us with two options, DVC and lakeFS. DVCs pipeline functionality is nice, but not a must-have. LakeFS really shines with branching etc and I’m still looking forward to them implementing distributed version control features someday.
I don’t agree with everything the author says, in particular, I do really think we just need one tool to do data versioning, and as said, DVC and lakeFS shine there. But still, the comparison is sound.