Versioning data in Verta

We’ll walk briefly through the concept of data versioning, and how it can be handled by the Verta client!

Data versioning is a key part of how Verta versions models; as a result, Verta supports datasets as first-class citizens of the system. This means that datasets (and dataset versions) can be created independent of models and experiments, reused across multiple models, and queried independently.

Context

In Verta, a dataset is conceptualized as a collection of dataset versions where each version contains metadata about the actual data—a new version of a dataset is created only if the system detects a change in the associated metadata.

The metadata stored for a dataset version depends on the type of dataset (e.g. RDBMS, file system, blob store). For instance, if a dataset is of type S3, then the version information consists of metadata about every object in a specified S3 bucket, such as each object’s size, date modified, and checksum.

The Verta client library provides a variety of APIs to perform operations on datasets and dataset versions. Once dataset versions have been created, any experiment run can be associated with one or more dataset versions using run.log_dataset_version() as shown in this guide.

Setup

This guide assumes a basic familiarity with Verta’s client interface.

Imagine we have some CSV files stored on Amazon S3 that we would like to associate with our workflows. We’d like to keep track of which files we’re accessing, and when they are accessed.

First, we’ll need to install boto3, the official Python library for Amazon Web Services:

pip install boto3

Installing the Verta client did not install boto3 automatically since it’s not required for core functionality, but it is required for data versioning with S3.

Don’t worry—Verta doesn’t download or store the actual S3 object; instead, we take in just enough information for you to later identify the snapshot of data that was used.

After installation, make sure AWS credentials are set up in your local environment, following their official instructions.

Version logging

Now, in Python, we’ll instantiate the Client:

from verta import Client
client = Client(host, email, dev_key)
# connection successfully established

The client can be used to create an S3Dataset:

dataset = client.set_dataset(name="Important Data",
                             type="s3")

A Dataset is a collection of related DatasetVersions, and we’ll need to create an S3DatasetVersion to be logged:

dataset_version = dataset.create_version(bucket_name="datasets",
                                         key="important-data.csv")

Note here that key is optional; if omitted, we would instead be tracking the entire bucket.

As we track our data science workflow, we can log this dataset version:

run.log_dataset_version("training_data", dataset_version)

Version viewing

Once a dataset version is logged, it can be viewed in the Verta Web App.

You’ll find the dataset version in the Datasets section of the ExperimentRun page:

../_images/dataset-version-section.png

Clicking on training_data will direct you to the DatasetVersion page:

../_images/dataset-version-popup.png

And there, you’ll find information about your dataset version:

../_images/dataset-version-page.png