Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
As an AI engineer or developer, you might have to prototype and deploy AI workloads with a range of different model weights. AKS provides the option to deploy inferencing workloads using open-source presets supported out-of-box and managed in the KAITO model registry or to dynamically download from the HuggingFace registry at runtime onto your AKS cluster.
In this article, you learn how to onboard a sample HuggingFace model for inferencing with the AI toolchain operator add-on without having to manage custom images on Azure Kubernetes Service (AKS).
Prerequisites
An Azure account with an active subscription. If you don't have an account, you can create one for free.
An AKS cluster with the AI toolchain operator add-on enabled. For more information, see Enable KAITO on an AKS cluster.
This example deployment requires quota for the
Standard_NCads_A100_v4virtual machine (VM) family in your Azure subscription. If you don't have quota for this VM family, please request a quota increase.Note
Currently, only the HuggingFace runtime supports inference with the KAITO custom model deployment template.
Choose an open-source language model from HuggingFace
In this example, we use the BigScience Bloom-1B7 small language model. Alternatively, you can choose from thousands of text-generation models supported on HuggingFace.
Connect to your AKS cluster using the
az aks get-credentialscommand.az aks get-credentials --resource-group <resource-group-name> --name <aks-cluster-name>Clone the KAITO project GitHub repository using the
git clonecommand.git clone https://github.com/kaito-project/kaito.git
Deploy your model inferencing workload using the KAITO workspace template
Navigate to the
kaitodirectory and copy the sample deployment YAML manifest. Replace the default values in the following fields with your model's requirements. For this example, we specify the bloom-1b7 HuggingFace model ID for BigScience Bloom-1B7 model:instanceType: The minimum VM size for this inference service deployment isStandard_NC24ads_A100_v4. For larger model sizes you can choose a VM in theStandard_NCads_A100_v4family with higher memory capacity.MODEL_ID: Replace with your model's specific HuggingFace identifier, which can be found afterhttps://huggingface.co/in the model card URL."--torch_dtype": Set to"float16"for compatibility with V100 GPUs. For A100, H100 or newer GPUs, use"bfloat16".- (Optional)
HF_TOKEN: Specify the values in this section only if you are deploying a private or gated Hugging Face model for inference.
apiVersion: kaito.sh/v1beta1 kind: Workspace metadata: name: workspace-custom-llm resource: instanceType: "Standard_NC24ads_A100_v4" # Replace with the required VM SKU based on model requirements labelSelector: matchLabels: apps: custom-llm inference: template: spec: containers: - name: custom-llm-container image: mcr.microsoft.com/aks/kaito/kaito-base:0.0.8 # KAITO base image which includes hf runtime livenessProbe: failureThreshold: 3 httpGet: path: /health port: 5000 scheme: HTTP initialDelaySeconds: 600 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 3 httpGet: path: /health port: 5000 scheme: HTTP initialDelaySeconds: 30 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: requests: nvidia.com/gpu: 1 # Request 1 GPU; adjust as needed limits: nvidia.com/gpu: 1 # Optional: Limit to 1 GPU command: - "accelerate" args: - "launch" - "--num_processes" - "1" - "--num_machines" - "1" - "--gpu_ids" - "all" - "tfs/inference_api.py" - "--pipeline" - "text-generation" - "--trust_remote_code" - "--allow_remote_files" - "--pretrained_model_name_or_path" - "bloom-1b7" - "--torch_dtype" - "bfloat16" # env: # HF_TOKEN is required only for private or gated Hugging Face models # Uncomment and configure this block if needed # - name: HF_TOKEN # valueFrom: # secretKeyRef: # name: hf-token-secret # Replace with your Kubernetes Secret name # key: HF_TOKEN # Replace with the specific key holding the token volumeMounts: - name: dshm mountPath: /dev/shm volumes: - name: dshm emptyDir: medium: MemorySave these changes to your
custom-model-deployment.yamlfile.Run the deployment in your AKS cluster using the
kubectl applycommand.kubectl apply -f custom-model-deployment.yaml
Test your custom model inferencing service
Track the live resource changes in your KAITO workspace using the
kubectl get workspacecommand.kubectl get workspace workspace-custom-llm -wNote
Note that machine readiness can take up to 10 minutes, and workspace readiness up to 20 minutes.
Check your language model inference service and get the service IP address using the
kubectl get svccommand.export SERVICE_IP=$(kubectl get svc workspace-custom-llm -o jsonpath='{.spec.clusterIP}')Test your custom model inference service with a sample input of your choice using the OpenAI API format:
kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$SERVICE_IP/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "bloom-1b7", "prompt": "What sport should I play in rainy weather?", "max_tokens": 20 }'
Clean up resources
If you no longer need these resources, you can delete them to avoid incurring extra Azure compute charges.
Delete the KAITO inference workspace using the kubectl delete workspace command.
kubectl delete workspace workspace-custom-llm
Next steps
In this article, you learned how to onboard a HuggingFace model for inferencing with the AI toolchain operator add-on directly to your AKS cluster. To learn more about AI and machine learning on AKS, see the following articles:
Azure Kubernetes Service