Data Versioning using DVC for MLOps pipelines

Atul Yadav

2 min read

January 21, 2024

Before talking about the importance of the Data versioning using the DVC, let’s first talk about the day-to-day challenges for Data scientists

  1. Tracking the data science project to be able to reproduce the results:- The live data systems are continuously ingesting newer data points while different users carry out different experiments on the same datasets. This leads to multiple versions of the same dataset.
  2. Auditing of Data & Models:- Several versions of the model using the same datasets of different versions, can create discrepancies. If not properly audited and versioned, this would create a tangled web of datasets and experiments

Why Is Data Versioning So Important?

Data versioning can be very useful for data reproducibility, trustworthiness, compilation, and auditing. Data versions uniquely identify revisions of a dataset, and the uniqueness helps consumers of such a dataset know whether and how the dataset has changed over a given period. They are able to identify specifically which version of the dataset is being used.

DATA VERSIONING USE CASES

  1. Data Tracking and maintainability: As a data scientist, you might not only want to control different versions of your code but also control different versions of your data.
  2. Tracking Model experiments: All the changes we make to data and in the model building process need to be tracked and measured to understand what has worked and what has not.
  3. Modularizing the Data Pipeline: DAG data pipelines to ensure reproducibility, facilitate experimentation and manage the dependencies.
  4. Parametrize your code: Test and track different experiments using distinct parameters
Source:DVC

WHY DVC ?

  1. Data Version Control (DVC) is a Git extension that adds functionality for managing your code and data together.
  2. DVC is a free open-source tool that will allow you to share and monitor experiments within the workflow of a machine learning project, just like the regular Git flow used in software development
  3. Version control of models and large or small data sets through the use of metafiles that point to the name and location in the cloud that you use in your project or company
  4. Seamless integration with various storage such as S3, Blob, -G-Drive, Ceph
  5. Easy to setup and User friendly
Source:DVC

DVC BEST PRACTICES

  1. Write descriptive Git commits when versioning data sets.
  2. Use DVC only for data-related tasks such as data set versioning, data processing routines, Not for logging experiment metrics and model weights.
  3. Use Data Registry whenever possible in order to centralize data sets that can be shared in different projects.

Common situations with DVC

Let’s say that 2 people P1 and P2 are working together on a machine learning project with an image data set. P1 works on branch B1 and P2 works on branch B2. The initial data set data/ (tracked by data.DVC) they are working on is the D1 version:

  1. First situation: only one of P1 and P2 modifies the data set. P1 modifies the data set and creates version D2 of the data set on branch B1. P2 finishes working on a new feature and he did not modify the data set D1. P2 needs to merge B1 into B2 and resolve the conflict on data set versions difference. As only one of the branches modified the original D1 data set, P2 can just replace its version of data.dvc (in branch B2) by branch B1’s version.
  2. Second situation: both P1 and P2 only add non-overlapping images to the data set. P1 only adds new images to the data set D1 and creates D2 on B1. P2 also only adds new images to the data set D1 and creates D3 on B2. Furthermore, the image subsets they both added are disjoint. In this case, the merger can use Git drive merger DVC: Merge conflicts, the append-only data set
  3. Third situation: both P1 and P2 modifies the data set (removal, addition, modification). P1 modifies the data set and creates version D2 of the data set in branch B1. P2 also modifies the data set D1 and creates version D3 in branch B2. P2 needs to merge B1 into B2 and resolve the conflict on the data set versions difference between D2 and D3. Here no assumption is made about the type of modifications to the data set, there can be removals, additions, and modifications to any file of the data set. If you want to actually merge all the modifications in both branches, this is the trickiest situation. Neither git nor DVC can directly help. You have to manually merge the data sets.

CONCLUSION

In this post, we learned about the importance of Data versioning using the DVC and how it can help data scientists. There are other paid tools in the market for data versioning, there are a lot of features that can be explored.

Thank you so much for spending your precious time reading this article. If you have any questions or comments, leave them below. I will try to answer them as good as I can.

See you in the next post!