Migrate legacy data versioning¶
In verta==0.16.0
, the dataset versioning interface was overhauled to be more flexible, robust, and consistent with other ModelDB entities. This quick guide presents an overview of how the
APIs have changed, and what code changes may be necessary to continue using
dataset versioning.
Local (path) example¶
There are two main changes to interacting with local datasets:
client.set_dataset()
no longer accepts atype
argument.- Instead of taking a str for a local path,
dataset.create_version()
now takes aPath
object that takes the str itself.
Before:
dataset = client.set_dataset(name="Census Income Local", type="local")
version = dataset.create_version(path="census-train.csv")
After:
from verta.dataset import Path
dataset = client.set_dataset(name="Census Income Local") # no `type`
version = dataset.create_version(Path("census-train.csv")) # new argument type
S3 example¶
There are two main changes to interacting with S3 datasets:
client.set_dataset()
no longer accepts atype
argument.- Instead of taking a
bucket_name
and/orkey
,dataset.create_version()
now takes aS3
object that takes the bucket and key itself, in the formf"s3://{bucket_name}/{key}"
.
Before:
dataset = client.set_dataset(name="Census Income S3", type="s3")
version = dataset.create_version(
bucket_name="verta-starter", key="census-train.csv"
)
After:
from verta.dataset import S3
dataset = client.set_dataset(name="Census Income S3") # no `type`
version = dataset.create_version(
S3("s3://verta-starter/census-train.csv") # one argument
)
Prefixes are also supported—so long as they end in a slash '/'
:
version = dataset.create_version(
S3("s3://verta-starter/models/") # all keys that begin with "models/"
)
base_path and other attributes¶
Several attributes of the old Dataset
and DatasetVersion
classes have
been ported over and are still usable; however, most of them have been
deprecated and will raise warnings accordingly—with guidance on how to update
them.
dataset_version.base_path
in particular is no longer available as a direct
attribute. Instead, to obtain usable paths of particular files, you may
retrieve them by accessing the components of the dataset version:
version = dataset.create_version(S3("s3://verta-starter/census-train.csv"))
version.list_components()[0].path
# s3://verta-starter/census-train.csv
Tips and tricks¶
One advantage of this new system is that you can preview the contents of your dataset version-to-be before creating it in ModelDB:
from verta.dataset import S3
S3("s3://verta-starter/census-train.csv")
# S3 Version
# s3://verta-starter/census-train.csv
# 3271573 bytes
# last modified: 2019-05-24 07:25:26
# MD5 checksum: 64af2ff44dd04acceb277d024939b619
The content object can also be recovered from a dataset version:
version = dataset.create_version(S3("s3://verta-starter/census-train.csv"))
version.get_content()
# S3 Version
# s3://verta-starter/census-train.csv
# 3271573 bytes
# last modified: 2019-05-24 07:25:26
# MD5 checksum: 64af2ff44dd04acceb277d024939b619
In addition, because a dataset is no longer restricted to a particular type, its versions can be different types themselves:
from verta.dataset import Path, S3
dataset = client.create_dataset()
dataset.create_version(Path("census-train.csv"))
dataset.create_version(S3("s3://verta-starter/census-train.csv"))
For complete functionality, please see the updated API reference!