CaraML Docs
CaraML Homepage
  • Introduction
    • What is CaraML?
    • Architecture
      • Feature Store Architecture
      • Models Architecture
      • Routers Architecture
      • Experiments Architecture
      • Pipelines Architecture
    • Core Concepts
      • Models Concepts
      • Router Concepts
      • Experiment Concepts
  • User guides
    • Projects
      • Create a project
      • Managing secrets
    • Feature Store
    • Models
      • Create a Model
        • Custom Model
      • Deploy a Model
        • Deploying a Model Version
        • Severing a Model Version
        • Configuring Transformer
          • Standard Transformer
            • Standard Transformer Expressions
            • Standard Transformer UPI
          • Custom Transformer
        • Redeploying a Model Version
      • Deleting a Model
      • Configuring Alerts
      • Batch Prediction
      • Model Schema
      • Model Observability
    • Routers
      • Creating a Router
        • Configure general settings
        • Configure routes
        • Configure traffic rules
        • Configure autoscaling
        • Configure experiment engine
        • Configure enricher
        • Configure ensembler
        • Configure logging
      • Viewing Routers
        • Configuration
        • History
        • Logs
        • More actions
      • Edit Routers
      • Monitoring router
        • Monitor Router Performance
        • Configure Alerts
      • Undeploying Router
      • Redeploying Router
        • Redeploy undeployed router
        • Redeploy version from history
        • Redeploy version from version details page
      • Deleting Router
        • Deleting router versions
        • Deleting router versions from details page
        • Deleting routers
      • Deleting Emsemblers
        • Delete an Ensembler without related entity
        • Delete an Ensembler with active entities
        • Delete an Ensembler with inactive entities
    • Experiments
      • View Experiment Settings
      • Modify Experiment Settings
      • Creating Experiments
      • Viewing Experiments
      • Modifying Experiments
      • Running Experiments
      • Monitoring Experiments
      • Creating Treatments
      • Viewing Treatments
      • Modifying Treatments
      • Creating Segments
      • Viewing Segments
      • Modifying Segments
      • Creating Custom Segmenters
      • Viewing Custom Segmenters
      • Modifying Custom Segmenters
    • Pipelines
  • Tutorial and Examples
    • Model Sample Notebooks
      • Deploy Standard Models
      • Deploy PyFunc Model
      • Using Transformers
      • Run Batch Prediction Job
      • Others examples on Models
    • Router Examples
    • Feature Store Examples
    • Pipeline Examples
    • Performing load test in CaraML
    • Best practice for CaraML
  • CaraML SDK
    • Feature Store SDK
    • Models SDK
    • Routers SDK
    • Pipeline SDK
  • Troubleshooting and FAQs
    • CaraML System FAQ
    • Models FAQ
      • System Limitations
      • Troubleshooting Deployment Errors
      • E2E Test
    • Routers FAQ
    • Experiments FAQ
    • Feature Store FAQ
    • Pipelines FAQ
    • CaraML Error Messages
  • Deployment Guide
    • Deploying CaraML
      • Local Development
    • Monitoring and alerting
      • Configure a monitoring backend
      • Configure an alerting backend
    • Prerequisites and Dependencies
    • System Benchmark results
    • Experiment Treatment Service
  • Release Notes
    • CaraML Release Notes
Powered by GitBook
On this page
  • GitOps Configuration
  • Available Metrics
  • Sample Alert Configuration
  1. Deployment Guide
  2. Monitoring and alerting

Configure an alerting backend

PreviousConfigure a monitoring backendNextPrerequisites and Dependencies

Last updated 2 years ago

The Turing UI exposes alerting configurations for various Prometheus metrics, derived from the Kube state metrics as well as the Knative default metrics. When an alert is configured by the user, the API publishes the using GitOps as an inventory for alerts. Appropriate CI/CD jobs may be configured on the git repo to apply the changes as desired (eg: publishing alerts to a Slack channel).

GitOps Configuration

Currently, only Gitlab repositories may be configured for publishing alerts. The required client configurations (such as the Gitlab token) may be set at deploy time, under AlertConfig.GitLab (please refer to the for an example).

Available Metrics

Name
Prometheus Metric
Source

throughput

revision_request_count

Knative

latency95p

revision_request_latencies_bucket

Knative

error_rate

revision_request_count

Knative

cpu_util

container_cpu_usage_seconds_total, kube_pod_container_resource_requests{resource="cpu"}

Kube state

memory_util

container_memory_usage_bytes, kube_pod_container_resource_requests{resource="memory"}

Kube state

Sample Alert Configuration

groups:
    - name: development_foo_turing-hello-turing-router_cpu_util
      rules:
        - alert: turing-hello-turing-router_cpu_util_violation_development
          expr: |-
            sum by(cluster) (rate(container_cpu_usage_seconds_total{environment="staging",pod=~"turing-hello-turing-router-[0-9]*.*"}[1m])) / sum by(cluster) (kube_pod_container_resource_requests{resource="cpu",environment="staging",pod=~"turing-hello-turing-router-[0-9]*.*"}) * 100 > 90
          for: 5m
          labels:
            owner: foo
            service_name: turing-hello-turing-router
            severity: warning
          annotations:
            dashboard: http://monitoring.com/turing-dashboard?var-cluster=test-kube-cluster&var-project=test-project&var-experiment=turing-hello
            description: 'cpu_util for the past 5m: {{ $value }}%'
            playbook: http://docs.com/Alert+Troubleshooting+Playbook
            summary: 'cpu_util is higher than the threshold: 90%'
        - alert: turing-hello-turing-router_cpu_util_violation_development
          expr: |-
            sum by(cluster) (rate(container_cpu_usage_seconds_total{environment="staging",pod=~"turing-hello-turing-router-[0-9]*.*"}[1m])) / sum by(cluster) (kube_pod_container_resource_requests{resource="cpu",environment="staging",pod=~"turing-hello-turing-router-[0-9]*.*"}) * 100 > 95
          for: 5m
          labels:
            owner: foo
            service_name: turing-hello-turing-router
            severity: critical
          annotations:
            dashboard: http://monitoring.com/turing-dashboard?var-cluster=test-kube-cluster&var-project=test-project&var-experiment=turing-hello
            description: 'cpu_util for the past 5m: {{ $value }}%'
            playbook: http://docs.com/Alert+Troubleshooting+Playbook
            summary: 'cpu_util is higher than the threshold: 95%'
Prometheus alerting rules
sample Helm values file