Edit

Share via


Integrate tool calling with LLM Inference with the AI toolchain operator add-on on Azure Kubernetes Service (AKS)

In this article, you configure and deploy an AI toolchain operator (KAITO) inference workspace on Azure Kubernetes Service (AKS) with support for OpenAI-style tool calling. You also learn how to validate tool calling functionality using vLLM metrics and local function mocks.

What is tool calling?

Tool calling enables large language models (LLMs) to interface with external functions, APIs, or services. Instead of just generating text, an LLM can decide:

  • "I need to call a weather API."
  • "I need to use a calculator."
  • "I should search a database."

It does this by invoking a defined “tool” with parameters it chooses based on the user’s request. Tool calling is useful for:

  • Chatbots that book, summarize, or calculate.
  • Enterprise LLM applications where hallucination must be minimized.
  • Agent frameworks (AutoGen, LangGraph, LangChain, AgentOps, etc.).

In production environments, AI-enabled applications often demand more than natural language generation; they require the ability to take action based on user intent. Tool calling empowers LLMs to extend beyond text responses by invoking external tools, APIs, or custom logic in real time. This bridges the gap between language understanding and execution, enabling developers to build interactive AI assistants, agents, and automation workflows that are both accurate and useful. Instead of relying on static responses, LLMs can now access live data, trigger services, and complete tasks on behalf of users, both safely and reliably.

When deployed on AKS, tool calling becomes scalable, secure, and production ready. Kubernetes provides the flexibility to orchestrate inference workloads using high-performance runtimes like vLLM, while ensuring observability and governance of tool usage. With this pattern, AKS operators and app developers can more seamlessly update models or tools independently and deploy advanced AI features without compromising reliability.

As a result, tool calling on AKS is now a foundational pattern for building modern AI apps that are context-aware, action-capable, and enterprise-ready.

Tool calling with KAITO

To streamline this deployment model, the AI toolchain operator (KAITO) add-on for AKS provides a managed solution for running inference services with tool calling support. By leveraging KAITO inference workspaces, you can quickly spin up scalable, GPU-accelerated model endpoints with built-in support for tool calling and OpenAI-compatible APIs. This eliminates the operational overhead of configuring runtimes, managing dependencies, or scaling infrastructure manually.

Prerequisites

  • This article assumes that you have an existing AKS cluster. If you don't have a cluster, create one by using the Azure CLI, Azure PowerShell, or the Azure portal.
  • Your AKS cluster is running on Kubernetes version 1.33 or higher. To upgrade your cluster, see Upgrade your AKS cluster.
  • Install and configure Azure CLI version 2.77.0 or later. To find your version, run az --version. To install or update, see Install the Azure CLI.
  • The AI toolchain operator add-on enabled on your cluster.
  • A deployed KAITO inference workspace that supports tool calling. Refer to the official KAITO tool calling documentation for the tool calling supported models with vLLM.
  • You deployed the workspace‑phi‑4-mini-toolcall KAITO workspace with the default configuration.

Confirm the KAITO inference workspace is running

  • Monitor your workspace deployment with the kubectl get command.

    kubectl get workspace workspace‑phi‑4‑mini-toolcall -w
    

    In the output, you want to verify the resource (ResourceReady) and inference (InferenceReady) are ready and the workspace succeeded (WorkspaceSucceeded being true).

Confirm the inference API is ready to serve

  1. Once the workspace is ready, find the service endpoint using the kubectl get command.

    kubectl get svc workspace‑phi‑4-mini-toolcall
    

    Note

    The output might be a ClusterIP or internal address. Check which port(s) the service listens on. The default KAITO inference API is on port 80 for HTTP. If it's only internal, you can port‑forward locally.

  2. Port-forward the inference service for testing using the kubectl port-forward command.

    kubectl port-forward svc/workspace‑phi‑4‑mini-toolcall 8000:80
    
  3. Check the /v1/models endpoint to confirm the LLM is available using curl.

    curl http://localhost:8000/v1/models
    

    To ensure the LLM is deployed, and the API is working, your output should be similar to the following:

    ...
    {
      "object": "list",
      "data": [
        {
          "id": "phi‑4‑mini‑instruct",
          ...
          ...
        }
      ]
    }
    ...
    

Test the named function tool‐calling

In this example, the workspace‑phi‑4‑mini-toolcall workspace supports named function tool-calling by default, so we can confirm the LLM accepts a “tool” spec in OpenAI‑style requests and returns a “function call” structure.

The Python snippet we use in this section is from the KAITO documentation and uses an OpenAI‑compatible client.

  • Confirm the LLM accepts a “tool” spec in OpenAI‑style requests and returns a “function call” structure. This example:

    • Initializes the OpenAI-compatible client to talk to a local inference server. The server is assumed to be running at http://localhost:8000/v1 and accepts OpenAI-style API calls.
    • Simulates the backend logic for a tool called get_weather. (In a real scenario, this would call a weather API.)
    • Describes the tool interface; the Phi-4-mini LLM will see this tool and decide whether to use it based on the user's input.
    • Sends a sample chat message to the model and provides the tool spec. The setting tool_choice="auto" allows the LLM to decide if it should call a tool based on the prompt.
    • In this case, the user's request was relevant to the get_weather tool, so we simulate the execution of the tool, calling the local function with the model's chosen arguments.
    from openai import OpenAI
    import json
    
    # local server
    client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
    
    def get_weather(location: str, unit: str) -> str:
        return f"Getting the weather for {location} in {unit}..."
    
    tool_functions = {"get_weather": get_weather}
    
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location", "unit"]
            }
        }
    }]
    
    response = client.chat.completions.create(
        model="phi‑4‑mini‑instruct",   # or client.models.list().data[0].id
        messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
        tools=tools,
        tool_choice="auto"
    )
    
    # Inspect response
    tool_call = response.choices[0].message.tool_calls[0].function
    args = json.loads(tool_call.arguments)
    print("Function called:", tool_call.name)
    print("Arguments:", args)
    print("Result:", tool_functions[tool_call.name](**args))
    

    Your output should look similar to the following:

    Function called: get_weather  
    Arguments: {"location": "San Francisco, CA", "unit": "fahrenheit"}  
    Result: Getting the weather for San Francisco, CA in fahrenheit...
    

    The “tool_calls” field comes back, meaning the Phi-4-mini LLM decided to invoke the function. Now, a sample tool call has been successfully parsed and executed based on the model’s decision to confirm end-to-end tool calling behavior with the KAITO inference deployment.

Troubleshooting

Model preset doesn’t support tool calling

If you pick a model that isn't on the supported list, tool calling might not work. Make sure you review the KAITO documentation, which explicitly lists which presets support tool calling.

Misaligned runtime

The KAITO inference must use vLLM runtime for tool calling (HuggingFace Transformers runtime generally doesn’t support tool calling in KAITO).

Network / endpoint issues

If port-forwarding, ensure the service ports are correctly forwarded. If the external MCP server is unreachable, will error out.

Timeouts

External MCP server calls might take time. Make sure the adapter or client timeout is sufficiently high.

Authentication

If the external MCP server requires authentication (API key, header, etc.), ensure you supply correct credentials.

Next steps