Integrate an MCP server with an LLM Inference Service on Azure Kubernetes Service (AKS) with the AI toolchain operator add-on

2025-09-30

In this article, you connect an MCP-compliant tool server with an AI toolchain operator (KAITO) inference workspace on Azure Kubernetes Service (AKS), enabling secure and modular tool calling for LLM applications. You also learn how to validate end-to-end tool invocation by integrating the model with the MCP server and monitoring real-time function execution through structured responses.

Model Context Protocol (MCP)

As an extension of KAITO inference with tool calling, the Model Context Protocol (MCP) provides a standardized way to define and expose tools for language models to call.

Tool calling with MCP makes it easier to connect language models to real services and actions without tightly coupling logic into the model itself. Instead of embedding every function or API call into your application code, MCP lets you run a standalone tool server that exposes standardized tools or APIs that any compatible LLM can use. This clean separation means you can update tools independently, share them across models, and manage them like any other microservice.

You can bring-your-own (BYO) internal or connect external MCP servers seamlessly with your KAITO inference workspace on AKS.

MCP with AI toolchain operator (KAITO) on AKS

You can register an external MCP server in a uniform, schema-driven format and serve it to any compatible inference endpoint, including those deployed with a KAITO workspace. This approach allows for externalizing business logic, decoupling model behavior from tool execution, and reusing tools across agents, models, and environments.

In this guide, you register a pre-defined MCP server, test real calls issued by an LLM running in a KAITO inference workspace, and confirm the entire tool execution path (from model prompt to MCP function invocation) works as intended. You have flexibility to scale or swap tools independent of your model.

Prerequisites

This article assumes that you have an existing AKS cluster. If you don't have a cluster, create one by using the Azure CLI, Azure PowerShell, or the Azure portal.
Your AKS cluster is running on Kubernetes version 1.33 or higher. To upgrade your cluster, see Upgrade your AKS cluster.
Install and configure Azure CLI version 2.77.0 or later. To find your version, run az --version. To install or update, see Install the Azure CLI.
You have the AI toolchain operator add-on enabled on your cluster and a KAITO workspace with tool calling support deployed on your cluster.
An external MCP server available at an accessible URL (e.g., https://mcp.example.com/mcp) that returns valid /list_tools and has stream transport.

Connect to a reference MCP server

In this example, we'll use a reference Time MCP Server, which provides time zone conversion capabilities and enables LLMs to get current time information and perform conversions using standardized names.

Port-forward the KAITO inference service

Confirm that your KAITO workspace is ready and retrieve the inference service endpoint using the kubectl get command.
```
kubectl get svc workspace‑phi‑4-mini-toolcall
```
Note

The output might be a ClusterIP or internal address. Check which port(s) the service listens on. The default KAITO inference API is on port 80 for HTTP. If it's only internal, you can port‑forward locally.
Port-forward the inference service for testing using the kubectl port-forward command.
```
kubectl port-forward svc/workspace‑phi‑4‑mini-toolcall 8000:80
```
Check /v1/models endpoint to confirm that Phi-4-mini-instruct LLM is available using curl.
```
curl http://localhost:8000/v1/models
```
Your Phi-4-mini-instruct OpenAI-compatible inference API will be available at:
```
http://localhost:8000/v1/chat/completions
```

Confirm the reference MCP server is valid

This example assumes that the Time MCP server is hosted at https://mcp.example.com.

Confirm the server returns tools using curl.

curl https://mcp.example.com/mcp/list_tools

Expected output:

{
  "tools": [
    {
      "name": "get_current_time",
      "description": "Get the current time in a specific timezone",
      "arguments": {
        "timezone": "string"
      }
    },
    {
      "name": "convert_time",
      "description": "Convert time between two timezones",
      "arguments": {
        "source_timezone": "string",
        "time": "string",
        "target_timezone": "string"
      }
    }
  ]
}

Connect MCP server to the KAITO workspace using API request

KAITO automatically fetches tool definitions from tools declared in API requests or registered dynamically inside the inference runtime (vLLM + MCP tool loader).

In this guide, we create a Python virtual environment to send a tool-calling request to the Phi-4-mini-instruct inference endpoint using the MCP definition and pointing to the server.

Define a new working directory for this test project.
```
mkdir kaito-mcp
cd kaito-mcp
```
Create a Python virtual environment and activate it so that all packages are local to your test project.
```
uv venv
source .venv/bin/activate
```
Use the open-source Autogen framework to test the tool calling functionality and install its dependencies:
```
uv pip install "autogen-ext[openai]" "autogen-agentchat" "autogen-ext[mcp]"
```

Create a test file named test.py that:

Connects to the Time MCP server and loads get_current_time tool.
Connects to your KAITO inference service running at localhost:8000.
Sends an example query like “What time is it in Europe/Paris?”
Enables automatic selection and calling of the get_current_time tool.

import asyncio

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.ui import Console
from autogen_core import CancellationToken
from autogen_core.models import ModelFamily, ModelInfo
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.tools.mcp import (StreamableHttpMcpToolAdapter,
                                StreamableHttpServerParams)
from openai import OpenAI


async def main() -> None:
    # Create server params for the Time MCP service
    server_params = StreamableHttpServerParams(
        url="https://mcp.example.com/mcp",
        timeout=30.0,
        terminate_on_close=True,
    )

    # Load the get_current_time tool from the server
    adapter = await StreamableHttpMcpToolAdapter.from_server_params(server_params, "get_current_time")

    # Fetch model name from KAITO's local OpenAI-compatible API
    model = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy").models.list().data[0].id

    model_info: ModelInfo = {
        "vision": False,
        "function_calling": True,
        "json_output": True,
        "family": ModelFamily.UNKNOWN,
        "structured_output": True,
        "multiple_system_messages": True,
    }

    # Connect to the KAITO inference workspace
    model_client = OpenAIChatCompletionClient(
        base_url="http://localhost:8000/v1",
        api_key="dummy",
        model=model,
        model_info=model_info
    )

    # Define the assistant agent
    agent = AssistantAgent(
        name="time-assistant",
        model_client=model_client,
        tools=[adapter],
        system_message="You are a helpful assistant that can provide time information."
    )

    # Run a test task that invokes the tool
    await Console(
        agent.run_stream(
            task="What time is it in Europe/Paris?",
            cancellation_token=CancellationToken()
        )
    )

if __name__ == "__main__":
    asyncio.run(main())

Run the test script in your virtual environment.

uv run test.py

In the output of this test, you should expect the following:

The model correctly generates a tool call using the MCP name and expected arguments.
Autogen sends the tool call to the MCP server, the MCP server runs the logic and returns a result.
The Phi-4-mini-instruct LLM interprets the raw tool output and provides a natural language response.

---------- TextMessage (user) ----------
What time is it in Europe/Paris?

---------- ToolCallRequestEvent (time-assistant) ----------
[FunctionCall(id='chatcmpl-tool-xxxx', arguments='{"timezone": "Europe/Paris"}', name='get_current_time')]

---------- ToolCallExecutionEvent (time-assistant) ----------
[FunctionExecutionResult(content='{"timezone":"Europe/Paris","datetime":"2025-09-17T17:43:05+02:00","is_dst":true}', name='get_current_time', call_id='chatcmpl-tool-xxxx', is_error=False)]

---------- ToolCallSummaryMessage (time-assistant) ----------
The current time in Europe/Paris is 5:43 PM (CEST).

Experiment with more MCP tools

You can test the various tools available to this MCP server, such as convert_time.

In your test.py file from the previous step, update your adapter definition to the following:

adapter = await StreamableHttpMcpToolAdapter.from_server_params(server_params, "convert_time")

Update your task definition to invoke the new tool. For example:
```
task="Convert 9:30 AM New York time to Tokyo time."
```

Save and run the Python script.

uv run test.py

Expected output:

9:30 AM in New York is 10:30 PM in Tokyo.

Troubleshooting

The following table outlines common errors when testing KAITO inference with an external MCP server and how to resolve them:

Error	How to resolve
`Tool not found`	Ensure that your tool name matches the one declared in `/mcp/list_tools`.
`401 Unauthorized`	If your MCP server requires an Auth token, make sure to update `server_params` to include headers with the Auth token.
`connection refused`	Ensure the KAITO inference service is port-forwarded properly (e.g. to `localhost:8000`).
`tool call ignored`	Review the KAITO tool calling documentation to find vLLM models that support tool calling.

Next steps

In this article, you learned how to connect a KAITO workspace to an external reference MCP server using Autogen to enable tool calling through the OpenAI-compatible API. You also validated that the LLM could discover, invoke, and integrate results from MCP-compliant tools on AKS. To learn more, see the following resources:

Deploy and test tool calls with the AI toolchain operator add-on on AKS.
KAITO tool calling and MCP official documentation.

Feedback

Was this page helpful?