Automated monitoring for Verta endpoints

Overview

For live endpoints deployed on Verta inference service, model monitoring is fully automated (currently available for tabular data). This capability is deeply integrated into our deployment and registry modules.

This is how it works:

Get started with automated monitoring by following these steps:

1. Log reference data

You need to log reference data (training data) when you are registering a model. The logged reference data is used for drift detection and compare live vs reference data distributions. Additionally, you will be able to track and visualize feature distrubutions for your training data in the web UI of a Registered Model Version detail page. Learn more about how to log and visualize training data distribution.

This is how you log reference data via client;

#register a model
registered_model = client.get_or_create_registered_model(name="monitoring-demo")
#create a model version
from verta.environment import Python
from verta.utils import ModelAPI
model_version = registered_model.create_standard_model_from_sklearn(
model,
environment=Python(requirements=["scikit-learn"]),
model_api=ModelAPI(X_train, Y_train),
name="v1"
)
#profile training data
model_version.log_training_data_profile(X_train, Y_train)

Verta uploads profiles of your training data to faciliate downstream monitoring. Individual data points are not uploaded—the client only passes along numerical and categorical distributions of the columns in your data.

Note that for the automated monitoring to work, you must log accurate model API and training data distribution.

View the reference data in Registered Model Version detail page.

2. Deploy a model

When a model is deployed, Verta creates a live endpoint and a monitored entity is automatically created (the monitored entity is directly connected to the endpoint - 1:1 mapping).

Deploy a model using client (you can also deploy with our web UI)

#deploy a model
endpoint = client.create_endpoint("monitoring-demo")
endpoint.update(model_version, wait=True)

A monitored entity is automatically logged. The monitored entity takes the name of the endpoint You can access the monitored entity from Web UI, Operations > Monitoring.

3. Access the monitored entity and statistical summaries

Statistical summaries (histograms, missing values etc.) are automatically defined using the information from endpoint and reference data

The system creates a range of histogram distributions, missing value summaries and more. You can visit the "Data Metrics" page to view, query and filter all the logged summaries.

You can navigate to a summary detailed view by clicking on the summary chart title from Data Metrics page.

4. Review system defined alerts and thresholds

Drift alerts are automatically configured with pre-defined thresholds (Users also have the ability to updqate system defined thresholds).

The manage alerts page, lists all the configured alerts. Click on the alert configuration to review alerts details, reference data and update alert threshold.

5. Run live predictions, get alerted and perform root cause analysis

As you start running live predictions, you can query, aggregate and visualize various summary statistics, get alerted on drifts and perform root cause analysis.

The "Dashboard" page in web UI, shows all the alerted summaries. You can also access "Active Alerts" page to review the list of all the active alerts and resolve an alert.

Each alerted summary has a alert summary detail view that shows start and end time of alerts, time series view of the summary, reference sample and differeneces in distribution in order to drill down and perform analysis.

Concepts

Downsampling

In order to monitor and view high volumes of live data, the system might do downsampling. This occurs in real time, as the data comes in. The downsampling logic is completely configurable for each deployment. For example: we can configure a limit of 10 samples per worker per second. So if you are seeing less resolution of data that expected, it could be because of downsampling.

Alerted summary detail chart

If a particular summary has an active alert, the charts will show the aggregated samples that are alerting along with the alert window using a red highlighter. On top of the aggregated samples bell icons represent different alerts states.

  • Red bell icon - Alert is ongoing

  • Blue bell icon - Current alert has ended

  • Green bell icon - The alert has been resolved

Examples

Given below are links to few end-to-end notebook examples that showcase how to deploy and log automated monitoring.