Endpoint autoscaling¶
Through an endpoint update, you can configure the deployment’s autoscaling behavior: upper and lower bounds for replication, scale-up rate, and metrics to trigger it.
Using the client¶
Endpoint.update()
provides a parameter for
configuring the endpoint’s autoscaling behavior. It can be used alongside any update strategy.
from verta.endpoint.update import DirectUpdateStrategy
endpoint.update(
model_version, DirectUpdateStrategy(),
autoscaling=autoscaling,
)
autoscaling
takes an Autoscaling
object,
which itself is used to establish upper and lower bounds for the number of replicas running the
model. Autoscaling must also have at least one metric associated with it, which sets a threshold
for triggering a scale-up.
from verta.deployment.autoscaling import Autoscaling
from verta.deployment.autoscaling.metrics import CpuUtilizationTarget
autoscaling = Autoscaling(max_replicas=4, min_scale=0.5)
autoscaling.add_metric(CpuUtilizationTarget(0.75))
Here, CPU utilization exceeding 75% will lead to more replicas being created. For the full list of available metrics, see the Autoscaling Metrics API documentation.
Using the CLI¶
Autoscaling can also be configured via the CLI:
verta deployment update endpoint /some-path --model-version-id "<id>" \
--strategy direct \
--autoscaling '{"max_replicas": 4, "min_scale": 0.5}' \
--autoscaling-metric '{"metric": "cpu_utilization", "parameters": [{"name": "target", "value": "0.75"}]}'
--autoscaling
takes a JSON string representing the quantites for replicas and scale;
--autoscaling-metric
takes a JSON string representing a metric and its values. The Python API
documentation for Endpoint Autoscaling and Autoscaling Metrics contain JSON-equivalent
examples for each object.
To set multiple metrics, --autoscaling-metric
can be provided more than
once.