Onboard custom models for inferencing with the AI toolchain operator (KAITO) on Azure Kubernetes Service (AKS)

2025-10-21

As an AI engineer or developer, you might have to prototype and deploy AI workloads with a range of different model weights. AKS provides the option to deploy inferencing workloads using open-source presets supported out-of-box and managed in the KAITO model registry or to dynamically download from the HuggingFace registry at runtime onto your AKS cluster.

In this article, you learn how to onboard a sample HuggingFace model for inferencing with the AI toolchain operator add-on without having to manage custom images on Azure Kubernetes Service (AKS).

Prerequisites

An Azure account with an active subscription. If you don't have an account, you can create one for free.
An AKS cluster with the AI toolchain operator add-on enabled. For more information, see Enable KAITO on an AKS cluster.
This example deployment requires quota for the Standard_NCads_A100_v4 virtual machine (VM) family in your Azure subscription. If you don't have quota for this VM family, please request a quota increase.

Note

Currently, only the HuggingFace runtime supports inference with the KAITO custom model deployment template.

Choose an open-source language model from HuggingFace

In this example, we use the BigScience Bloom-1B7 small language model. Alternatively, you can choose from thousands of text-generation models supported on HuggingFace.

Connect to your AKS cluster using the az aks get-credentials command.

az aks get-credentials --resource-group <resource-group-name> --name <aks-cluster-name>

Clone the KAITO project GitHub repository using the git clone command.
```
git clone https://github.com/kaito-project/kaito.git
```

Deploy your model inferencing workload using the KAITO workspace template

Navigate to the kaito directory and copy the sample deployment YAML manifest. Replace the default values in the following fields with your model's requirements. For this example, we specify the bloom-1b7 HuggingFace model ID for BigScience Bloom-1B7 model:

instanceType: The minimum VM size for this inference service deployment is Standard_NC24ads_A100_v4. For larger model sizes you can choose a VM in the Standard_NCads_A100_v4 family with higher memory capacity.
MODEL_ID: Replace with your model's specific HuggingFace identifier, which can be found after https://huggingface.co/ in the model card URL.
"--torch_dtype": Set to "float16" for compatibility with V100 GPUs. For A100, H100 or newer GPUs, use "bfloat16".
(Optional) HF_TOKEN: Specify the values in this section only if you are deploying a private or gated Hugging Face model for inference.

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-custom-llm
resource:
  instanceType: "Standard_NC24ads_A100_v4" # Replace with the required VM SKU based on model requirements
  labelSelector:
    matchLabels:
      apps: custom-llm
inference:
  template:
    spec:
      containers:
        - name: custom-llm-container
          image: mcr.microsoft.com/aks/kaito/kaito-base:0.0.8 # KAITO base image which includes hf runtime
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 5000
              scheme: HTTP
            initialDelaySeconds: 600
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 5000
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            requests:
              nvidia.com/gpu: 1  # Request 1 GPU; adjust as needed
            limits:
              nvidia.com/gpu: 1  # Optional: Limit to 1 GPU
          command:
            - "accelerate"
          args:
            - "launch"
            - "--num_processes"
            - "1"
            - "--num_machines"
            - "1"
            - "--gpu_ids"
            - "all"
            - "tfs/inference_api.py"
            - "--pipeline"
            - "text-generation"
            - "--trust_remote_code"
            - "--allow_remote_files"
            - "--pretrained_model_name_or_path"
            - "bloom-1b7"
            - "--torch_dtype"
            - "bfloat16"
          # env:
          #   HF_TOKEN is required only for private or gated Hugging Face models
          #   Uncomment and configure this block if needed
          #   - name: HF_TOKEN
          #     valueFrom:
          #       secretKeyRef:
          #         name: hf-token-secret     # Replace with your Kubernetes Secret name
          #         key: HF_TOKEN             # Replace with the specific key holding the token
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory

Save these changes to your custom-model-deployment.yaml file.
Run the deployment in your AKS cluster using the kubectl apply command.
```
kubectl apply -f custom-model-deployment.yaml
```

Test your custom model inferencing service

Track the live resource changes in your KAITO workspace using the kubectl get workspace command.
```
kubectl get workspace workspace-custom-llm -w
```
Note

Note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes.
Check your language model inference service and get the service IP address using the kubectl get svc command.
```
export SERVICE_IP=$(kubectl get svc workspace-custom-llm -o jsonpath='{.spec.clusterIP}')
```

Test your custom model inference service with a sample input of your choice using the OpenAI API format:

   kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bloom-1b7",
    "prompt": "What sport should I play in rainy weather?",
    "max_tokens": 20
  }'

Clean up resources

If you no longer need these resources, you can delete them to avoid incurring extra Azure compute charges.

Delete the KAITO inference workspace using the kubectl delete workspace command.

kubectl delete workspace workspace-custom-llm

Next steps

In this article, you learned how to onboard a HuggingFace model for inferencing with the AI toolchain operator add-on directly to your AKS cluster. To learn more about AI and machine learning on AKS, see the following articles:

Feedback

Was this page helpful?