Deploying a Model Version
Last updated
Last updated
To start sending inference requests to a model version, it must first be deployed. During deployment, different configurations can be chosen such as the number of replicas, CPU/memory requests, autoscaling policy, environment variables, etc. The set of these configurations that are used to deploy a model version is called a deployment.
A model may have any number of versions. But, at any given time, only a maximum of 2 model versions can be deployed.
When a model version is deployed, a Model Version Endpoint is created. The URL is of the following format:
For example a Model named my-model
within Project named my-project
with the base domain models.id.merlin.dev
will have a Model Version Endpoint for version 1
as follows:
A Model Version Endpoint has several states:
pending: The initial state of a Model Version Endpoint.
running: Once deployed, a Model Version Endpoint is in running state and is accessible.
serving: A Model Version Endpoint is in serving state if a Model Endpoint is created from it.
terminated: Once undeployed, a Model Version Endpoint is in terminated state.
failed: If an error occurred during deployment.
Depending on the type of the model being deployed, there may be an intermediate step to build the Docker image (using Kaniko). This is applicable to PyFunc models.
A model version can be deployed via the SDK or the UI.
Here's the example to deploy a Model Version Endpoint using Merlin Python SDK:
The Deploy option can be selected from the model versions view.
The deployment modes supported by Merlin have their own advantages and disadvantages, listed below.
Serverless Deployment:
Pros: Supports more advanced autoscaling policy (RPS, Concurrency); supports scale down to zero.
Cons: Slower compared to RAW_DEPLOYMENT
due to infrastructure overhead
Raw Deployment:
Pros: Relatively faster compared to SERVERLESS
deployments; less infrastructure overhead and more cost efficient.
Cons: Supports only autoscaling based on CPU usage.
Users are able to configure the deployment mode of their model via the SDK or the UI.
Example below will configure the deployment mode to use RAW_DEPLOYMENT
Merlin supports configurable autoscaling policy to ensure that users have complete control over the autoscaling behavior of their models. There are 4 types of autoscaling metrics in Merlin:
CPU Utilization: The autoscaling is based on the ration of model service's CPU usage and its CPU request. This autoscaling policy is available on all deployment mode.
Memory Utilization: The autoscaling is based on the ration of model service's Memory usage and its Memory request. This autoscaling policy is available only on SERVERLESS
deployment mode.
Model Throughput (RPS): The autoscaling is based on RPS per replica of the model service. This autoscaling policy is available only on SERVERLESS
deployment mode.
Concurrency: The autoscaling is based on number of concurrent request served by a replica of the model service. This autoscaling policy is available only on SERVERLESS
deployment mode.
Users can update the autoscaling policy via the SDK or the UI.
Below is the example of configuring autoscaling policy of a SERVERLESS
deployment to use RPS
metrics.
By default, Merlin determines the CPU limits of all model deployments using platform-level configured values. These CPU limits can either be calculated as a factor of the user-defined CPU request value for each deployment (e.g. 2x of the CPU request value) or as a constant value across all deployments.
However, users can override this platform-level configured value by setting this value explicitly on the UI or on the SDK.
On the UI:
On the SDK:
When deploying a model version, the model container will be built with a livenes probe by default. The liveness probe will periodically check that your model is still alive, and restart the pod automatically if it is deemed to be dead.
However, should you wish to disable this probe, you may do so by providing an environment variable to the model service with the following value:
This can be supplied via the deploy function. i.e.
The liveness probe is also available for the transformer. More details can be found at:
Merlin supports 2 types of deployment mode: SERVERLESS
and RAW_DEPLOYMENT
. Under the hood, SERVERLESS
deployment uses KNative as the serving stack. On the other hand RAW_DEPLOYMENT
uses native .