The most challenging and yet crucial operation in operationalization of models is model deployment and release. Due to the diversity of ML frameworks and libraries, and the lack of common systems for ML development vs. software delivery systems, it takes many months to release models into products.
One of Verta's key innovations is model deployment and release system that works seamlessly with models built in over a dozen frameworks and languages, and integrates with state-of-the-art DevOps and software delivery systems.
Verta endpoint is a containerized microservice serving a model that allows you to make real-time predictions using REST API or using client libraries. An endpoint can be deployed through the client, web app, or CLI interfaces.
An endpoint in Verta can be deployed from a Registered Model Version. You can use Verta Model Registry to manage different model versions and deploy endpoints. An endpoint can also be directly deployed from an Experiment Run.
Models can be packaged in a particular format )(e.g. docker container, Spark UDF, Python package, etc.). It can be pushed for deployment to any infrastructure. Deployments can be pushed to Spark, Kafka, UDF or any custom format. All APIs are exposed, so your CI/CD system can drive any part of the process.
Quick deploy is a one-click deploy option where the Verta platform applies default configurations and chooses an endpoint pathname for you. This is recommended for quick iterations. You can quickly deploy an endpoint and then update the configurations later.
Configurable deploy allows you to select all the configurations in advance before deploying an endpoint. Some of the supported advanced configurations are autoscaling, environment variables, and computational resources.
Verta endpoints provide easy access to the build details and run logs. The Build ID, model details are appended each time the endpoint is updated with a new release.
Autoscaling capabilities allow the system to adapt to changing query loads. Verta models are always warm, i.e there is always at least one copy of a deployed model running and can instantly serve requests. Verta manages replicas of models to ensure models scale based on traffic patterns. You can define your scale-up and scale-down replica configuration and choose to trigger auto scaling based on memory utilization, CPU utilization, or the number of requests per worker.
By default endpoints are created in a shared mode. In a shared mode, resources are shared between endpoints. Endpoints can also be updated to run in dedicated resource mode. A dedicated resource mode blocks entire nodes for model inferences of an endpoint and prevents noisy neighbor problems. Resource mode can be updated by going to performance section in endpoints update tab.
Environment variables let you control specific logic in your application dynamically during runtime. For example, if you want to add different behavior for production vs development endpoint, you can use VAR_DEV or VAR_PROD variables for the system to behave differently based on the value. Another example can be PREPROCESSING_METHOD=none and PREPROCESSING_METHOD=tokenizer.
The rollout strategy lets you define how you want to deploy or update your service. You can choose direct or canary rollouts for endpoints.
With the direct rollout, all your workloads are quickly switched to the new model version.
Canary deploys update versions by progressively switching traffic from one version of the instances to another. It helps you deploy the change impacting a smaller number of requests and users to analyze the impact. In addition, you can also control what percentage of traffic is routed to new deployment for a controlled traffic cutover.
This is a safe deployment choice with the option to hold the updates if the release does not meet performance expectations.
You can configure the following canary parameters:
- Interval: The time duration for each phase of the rollout (30s, 1m, 5m. 15m, 30m, 1hr, etc.)
- Steps: The percentage of production workload getting the new build with each phase of the canary roll-out
- Rules & Thresholds: Define one or more rules and thresholds that determine when to auto roll back the release. For example; if the maximum error percentage is above a defined threshold or maximum average latency is more than the acceptable limit etc
To update an endpoint for a new rollout, you can deploy a model either from Model Registry (with a Registered Model Version) or Projects (with an Experiment Run).
Deploying through Catalog is the recommended approach for production and staged releases. You need to choose the Registered Model and specific Registered Model Version to deploy
Use Experiment Run when you want to quickly deploy and test a model for ad hoc purposes. An experiment run corresponds to one execution of a modeling script or run when you use Verta’s experiment management module, and represents a particular configuration of an Experiment. If you choose to deploy an Experiment Run, you need to provide the Experiment Run ID from the modeling project.
A new build is triggered when you update an endpoint. You also have the option to update with an existing build.
Once an endpoint is live, you can perform different levels of updates.
- Make a dynamic update to the endpoint by changing autoscaling parameters or enabling/disabling autoscaling.
- Make a build update using the same image by updating compute resource allocation or environment variables etc.
- Perform a completely new roll-out using a different image