Versioning data in Verta
We'll walk briefly through the concept of data versioning, and how it can be handled by the Verta client.
Data versioning is a key part of how Verta versions models; as a result, Verta supports datasets as first-class citizens of the system. This means that datasets (and dataset versions) can be created independent of models and experiments, reused across multiple models, and queried independently.
In Verta, a dataset is conceptualized as a collection of dataset versions where each version contains metadata about the actual data—a new version of a dataset is created only if the system detects a change in the associated metadata.
The metadata stored for a dataset version depends on the type of dataset (e.g. RDBMS, file system, blob store). For instance, if a dataset is of type S3, then the version information consists of metadata about every object in a specified S3 bucket, such as each object's size, date modified, and checksum.
The Verta client library provides a variety of APIs to perform operations on datasets and dataset versions. Once dataset versions have been created, any experiment run can be associated with one or more dataset versions using
ExperimentRun.log_dataset_version()as shown in this guide.
This guide assumes a basic familiarity with Verta's client interface.
First, we'll need to install
boto3, the official Python library for Amazon Web Services:
pip install boto3
Installing the Verta client did not install
boto3automatically since it's not required for core functionality, but it is required for data versioning with S3.
Don't worry—Verta doesn't download or store the actual S3 object; instead, we take in just enough information for you to later identify the snapshot of data that was used.
Now, in Python, we'll instantiate the
from verta import Client
client = Client(host, email=email, dev_key=dev_key)
# connection successfully established
The client can be used to create an
dataset = client.create_dataset(name="Census Income Data")
A Dataset is a collection of related _DatasetVersion_s, and we'll need to create an
S3DatasetVersionto be logged:
from verta.dataset import S3
dataset_version = dataset.create_version(S3("s3://verta-starter/census-train.csv"))
If we wished to track the entire bucket, we could pass in
As we track our data science workflow, we can log this dataset version:
Once a dataset version is logged, it can be viewed in the Verta Web App.
You'll find the dataset version in the Datasets section of the ExperimentRun page:
Clicking on training_data will direct you to the DatasetVersion page:
And there, you'll find information about your dataset version: