Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
In this article, you configure and deploy an AI toolchain operator (KAITO) inference workspace on Azure Kubernetes Service (AKS) with support for OpenAI-style tool calling. You also learn how to validate tool calling functionality using vLLM metrics and local function mocks.
What is tool calling?
Tool calling enables large language models (LLMs) to interface with external functions, APIs, or services. Instead of just generating text, an LLM can decide:
- "I need to call a weather API."
- "I need to use a calculator."
- "I should search a database."
It does this by invoking a defined “tool” with parameters it chooses based on the user’s request. Tool calling is useful for:
- Chatbots that book, summarize, or calculate.
- Enterprise LLM applications where hallucination must be minimized.
- Agent frameworks (AutoGen, LangGraph, LangChain, AgentOps, etc.).
In production environments, AI-enabled applications often demand more than natural language generation; they require the ability to take action based on user intent. Tool calling empowers LLMs to extend beyond text responses by invoking external tools, APIs, or custom logic in real time. This bridges the gap between language understanding and execution, enabling developers to build interactive AI assistants, agents, and automation workflows that are both accurate and useful. Instead of relying on static responses, LLMs can now access live data, trigger services, and complete tasks on behalf of users, both safely and reliably.
When deployed on AKS, tool calling becomes scalable, secure, and production ready. Kubernetes provides the flexibility to orchestrate inference workloads using high-performance runtimes like vLLM, while ensuring observability and governance of tool usage. With this pattern, AKS operators and app developers can more seamlessly update models or tools independently and deploy advanced AI features without compromising reliability.
As a result, tool calling on AKS is now a foundational pattern for building modern AI apps that are context-aware, action-capable, and enterprise-ready.
Tool calling with KAITO
To streamline this deployment model, the AI toolchain operator (KAITO) add-on for AKS provides a managed solution for running inference services with tool calling support. By leveraging KAITO inference workspaces, you can quickly spin up scalable, GPU-accelerated model endpoints with built-in support for tool calling and OpenAI-compatible APIs. This eliminates the operational overhead of configuring runtimes, managing dependencies, or scaling infrastructure manually.
Prerequisites
- This article assumes that you have an existing AKS cluster. If you don't have a cluster, create one by using the Azure CLI, Azure PowerShell, or the Azure portal.
- Your AKS cluster is running on Kubernetes version
1.33or higher. To upgrade your cluster, see Upgrade your AKS cluster. - Install and configure Azure CLI version
2.77.0or later. To find your version, runaz --version. To install or update, see Install the Azure CLI. - The AI toolchain operator add-on enabled on your cluster.
- A deployed KAITO inference workspace that supports tool calling. Refer to the official KAITO tool calling documentation for the tool calling supported models with vLLM.
- You deployed the
workspace‑phi‑4-mini-toolcallKAITO workspace with the default configuration.
Confirm the KAITO inference workspace is running
Monitor your workspace deployment with the
kubectl getcommand.kubectl get workspace workspace‑phi‑4‑mini-toolcall -wIn the output, you want to verify the resource (
ResourceReady) and inference (InferenceReady) are ready and the workspace succeeded (WorkspaceSucceededbeingtrue).
Confirm the inference API is ready to serve
Once the workspace is ready, find the service endpoint using the
kubectl getcommand.kubectl get svc workspace‑phi‑4-mini-toolcallNote
The output might be a
ClusterIPor internal address. Check which port(s) the service listens on. The default KAITO inference API is on port80for HTTP. If it's only internal, you can port‑forward locally.Port-forward the inference service for testing using the
kubectl port-forwardcommand.kubectl port-forward svc/workspace‑phi‑4‑mini-toolcall 8000:80Check the
/v1/modelsendpoint to confirm the LLM is available usingcurl.curl http://localhost:8000/v1/modelsTo ensure the LLM is deployed, and the API is working, your output should be similar to the following:
... { "object": "list", "data": [ { "id": "phi‑4‑mini‑instruct", ... ... } ] } ...
Test the named function tool‐calling
In this example, the workspace‑phi‑4‑mini-toolcall workspace supports named function tool-calling by default, so we can confirm the LLM accepts a “tool” spec in OpenAI‑style requests and returns a “function call” structure.
The Python snippet we use in this section is from the KAITO documentation and uses an OpenAI‑compatible client.
Confirm the LLM accepts a “tool” spec in OpenAI‑style requests and returns a “function call” structure. This example:
- Initializes the OpenAI-compatible client to talk to a local inference server. The server is assumed to be running at
http://localhost:8000/v1and accepts OpenAI-style API calls. - Simulates the backend logic for a tool called
get_weather. (In a real scenario, this would call a weather API.) - Describes the tool interface; the
Phi-4-miniLLM will see this tool and decide whether to use it based on the user's input. - Sends a sample chat message to the model and provides the tool spec. The setting
tool_choice="auto"allows the LLM to decide if it should call a tool based on the prompt. - In this case, the user's request was relevant to the
get_weathertool, so we simulate the execution of the tool, calling the local function with the model's chosen arguments.
from openai import OpenAI import json # local server client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") def get_weather(location: str, unit: str) -> str: return f"Getting the weather for {location} in {unit}..." tool_functions = {"get_weather": get_weather} tools = [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location", "unit"] } } }] response = client.chat.completions.create( model="phi‑4‑mini‑instruct", # or client.models.list().data[0].id messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}], tools=tools, tool_choice="auto" ) # Inspect response tool_call = response.choices[0].message.tool_calls[0].function args = json.loads(tool_call.arguments) print("Function called:", tool_call.name) print("Arguments:", args) print("Result:", tool_functions[tool_call.name](**args))Your output should look similar to the following:
Function called: get_weather Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"} Result: Getting the weather for San Francisco, CA in fahrenheit...The “tool_calls” field comes back, meaning the
Phi-4-miniLLM decided to invoke the function. Now, a sample tool call has been successfully parsed and executed based on the model’s decision to confirm end-to-end tool calling behavior with the KAITO inference deployment.- Initializes the OpenAI-compatible client to talk to a local inference server. The server is assumed to be running at
Troubleshooting
Model preset doesn’t support tool calling
If you pick a model that isn't on the supported list, tool calling might not work. Make sure you review the KAITO documentation, which explicitly lists which presets support tool calling.
Misaligned runtime
The KAITO inference must use vLLM runtime for tool calling (HuggingFace Transformers runtime generally doesn’t support tool calling in KAITO).
Network / endpoint issues
If port-forwarding, ensure the service ports are correctly forwarded. If the external MCP server is unreachable, will error out.
Timeouts
External MCP server calls might take time. Make sure the adapter or client timeout is sufficiently high.
Authentication
If the external MCP server requires authentication (API key, header, etc.), ensure you supply correct credentials.
Next steps
- Set up vLLM monitoring in the AI toolchain operator add-on with Prometheus and Grafana on AKS.
- Learn about MCP server support with KAITO and test standardized tool calling examples on your AKS cluster.
Azure Kubernetes Service