Optimize endpoints
Many deep learning models today are extremely large, requiring large amounts of CPU and memory to run. For this purpose, Verta provides different ways to specify resources available to the deployed model (i.e., the endpoint.)
These are different ways you can optimize Verta endpoint serving large and compute heavy models.
1. Assigning resources to large models
There are primarily two types of resources that can be adjusted in the Verta system for models that run on CPUs.
CPU units
This resource limit controls how many cores are available to the container running your model. CPU units can be specified between MIN and MAX configured for your Verta instance. For higher CPU requirements, contact your Verta admin.
Memory
This resource limit controls the amount of memory available to the container running your model. For deep learning models, this is the main resource limit that will impact your ability to successfully make predictions. For example, TFHub models often require up to 5 GBs of RAM to run.
Memory units can range from MIN and MAX configured for your Verta instance. For higher memory requirements, contact your Verta admin.
If a successfully deployed Verta model has inadequate memory, Verta will try to increase its memory automatically if possible. You can check any errors in the log tab..
As Verta containerizes models to run on Kubernetes, both resources are specified in the same manner as resources are specified on Kubernetes.
To customize resources, go to the update tab of the endpoint in web UI, select enable customization of resources and specifiy CPU and memory limits.
Here is the web UI confiiguration example:
The resource setting can be set up in code as follows:
Endpoint.update()
provides a parameter for configuring the endpoint's compute resources. It can be used alongside any update strategy.
resources
specifies the computational resources that will be available to the model when it is deployed.
In this example, each replica will be provided a fourth of a CPU core and 512 Mi of RAM. For more information about available resources and units, see the update-resources
API documentation.
2. Setting autoscaling parameters
One of the important requirements of a model serving system is its ability to scale up and down in response to changing model prediction requests. Verta provides a flexible mechanism to autoscale an endpoint’s replicas in response to changing workloads.
To enable autoscaling, you can either use the UI or client APIs to setup autoscaling parameters.
On the UI, remember to check the autoscaling box and select the autoscaling metric thresholds in order to enable the functionality.
To customize how autoscaling works, Verta provides a few different configurations:
Max and min replicas:
These are respectively the max and min replicas of the model container that can be running at any given time. You can tune the max replica count depending on how many requests you expect at the peak workload and latency permitted. A good starting point can be 2 (MIN) and 10 (MAX)
Max and min growth factors:
The max_scale and min_scale are the limiting factors for scaling up and down the system. For example, max_scale
is the factor for scaling up: 2 means to double the number of replicas per time interval when it exceeds the threshold specified by the autoscaling metric. The min_scale
is how much to scale down by: 0.5 means scale down by half per time interval when it drops below a threshold. You can configure these parameters for safety to avoid responding too abruptly to bursts or to quiet periods.
Metrics:
These are the metrics and their thresholds that will trigger autoscaling. For example, you may want to trigger autoscaling when average CPU utilization of the model reaches a threshold of 0.7. In this case, you can use the CPU utilization target metric. A 0.7 average CPU utilization means on average, each pod's containers are using 70% of their CPU limit.
The baseline autoscaling recommended by Verta is to use the RequestsPerWorkerTarget and set it to 1. This target specifies that autoscaling should trigger when the number of inflight requests is greater than the number of workers to handle them.
The baseline autoscaling can be set up in code as follows:
Endpoint.update()` provides a parameter for configuring the endpoint's autoscaling behavior.
autoscaling
takes an Autoscaling
object, which itself is used to establish upper and lower bounds for the number of replicas running the model. Autoscaling must also have at least one metric associated with it, which sets a threshold for triggering a scale-up.
Note: Endpoint resources specify resources available to each model container. Autoscaling determines when and how model containers will be replicated to support a larger number of requests.
3. Running endpoints in dedicated resource mode
By default endpoints are created in a shared mode. In a shared mode, resources are shared between endpoints.
You can opt to configure dedicated machines for high throughput and compute heavy models. Dedicated machines can be configured by updating endpoints to run in dedicated resource mode. A dedicated resource mode blocks entire nodes for model inferences of an endpoint and prevents noisy neighbor problems.
Resource mode can be updated by going to performance section in endpoints update tab.
Last updated