Edit

Share via


Integrate an MCP server with an LLM Inference Service on Azure Kubernetes Service (AKS) with the AI toolchain operator add-on

In this article, you connect an MCP-compliant tool server with an AI toolchain operator (KAITO) inference workspace on Azure Kubernetes Service (AKS), enabling secure and modular tool calling for LLM applications. You also learn how to validate end-to-end tool invocation by integrating the model with the MCP server and monitoring real-time function execution through structured responses.

Model Context Protocol (MCP)

As an extension of KAITO inference with tool calling, the Model Context Protocol (MCP) provides a standardized way to define and expose tools for language models to call.

Tool calling with MCP makes it easier to connect language models to real services and actions without tightly coupling logic into the model itself. Instead of embedding every function or API call into your application code, MCP lets you run a standalone tool server that exposes standardized tools or APIs that any compatible LLM can use. This clean separation means you can update tools independently, share them across models, and manage them like any other microservice.

You can bring-your-own (BYO) internal or connect external MCP servers seamlessly with your KAITO inference workspace on AKS.

MCP with AI toolchain operator (KAITO) on AKS

You can register an external MCP server in a uniform, schema-driven format and serve it to any compatible inference endpoint, including those deployed with a KAITO workspace. This approach allows for externalizing business logic, decoupling model behavior from tool execution, and reusing tools across agents, models, and environments.

In this guide, you register a pre-defined MCP server, test real calls issued by an LLM running in a KAITO inference workspace, and confirm the entire tool execution path (from model prompt to MCP function invocation) works as intended. You have flexibility to scale or swap tools independent of your model.

Prerequisites

Connect to a reference MCP server

In this example, we'll use a reference Time MCP Server, which provides time zone conversion capabilities and enables LLMs to get current time information and perform conversions using standardized names.

Port-forward the KAITO inference service

  1. Confirm that your KAITO workspace is ready and retrieve the inference service endpoint using the kubectl get command.

    kubectl get svc workspace‑phi‑4-mini-toolcall
    

    Note

    The output might be a ClusterIP or internal address. Check which port(s) the service listens on. The default KAITO inference API is on port 80 for HTTP. If it's only internal, you can port‑forward locally.

  2. Port-forward the inference service for testing using the kubectl port-forward command.

    kubectl port-forward svc/workspace‑phi‑4‑mini-toolcall 8000:80
    
  3. Check /v1/models endpoint to confirm that Phi-4-mini-instruct LLM is available using curl.

    curl http://localhost:8000/v1/models
    

    Your Phi-4-mini-instruct OpenAI-compatible inference API will be available at:

    http://localhost:8000/v1/chat/completions
    

Confirm the reference MCP server is valid

This example assumes that the Time MCP server is hosted at https://mcp.example.com.

  • Confirm the server returns tools using curl.

    curl https://mcp.example.com/mcp/list_tools
    

    Expected output:

    {
      "tools": [
        {
          "name": "get_current_time",
          "description": "Get the current time in a specific timezone",
          "arguments": {
            "timezone": "string"
          }
        },
        {
          "name": "convert_time",
          "description": "Convert time between two timezones",
          "arguments": {
            "source_timezone": "string",
            "time": "string",
            "target_timezone": "string"
          }
        }
      ]
    }
    

Connect MCP server to the KAITO workspace using API request

KAITO automatically fetches tool definitions from tools declared in API requests or registered dynamically inside the inference runtime (vLLM + MCP tool loader).

In this guide, we create a Python virtual environment to send a tool-calling request to the Phi-4-mini-instruct inference endpoint using the MCP definition and pointing to the server.

  1. Define a new working directory for this test project.

    mkdir kaito-mcp
    cd kaito-mcp
    
  2. Create a Python virtual environment and activate it so that all packages are local to your test project.

    uv venv
    source .venv/bin/activate
    
  3. Use the open-source Autogen framework to test the tool calling functionality and install its dependencies:

    uv pip install "autogen-ext[openai]" "autogen-agentchat" "autogen-ext[mcp]"
    
  4. Create a test file named test.py that:

    • Connects to the Time MCP server and loads get_current_time tool.
    • Connects to your KAITO inference service running at localhost:8000.
    • Sends an example query like “What time is it in Europe/Paris?”
    • Enables automatic selection and calling of the get_current_time tool.
    import asyncio
    
    from autogen_agentchat.agents import AssistantAgent
    from autogen_agentchat.ui import Console
    from autogen_core import CancellationToken
    from autogen_core.models import ModelFamily, ModelInfo
    from autogen_ext.models.openai import OpenAIChatCompletionClient
    from autogen_ext.tools.mcp import (StreamableHttpMcpToolAdapter,
                                    StreamableHttpServerParams)
    from openai import OpenAI
    
    
    async def main() -> None:
        # Create server params for the Time MCP service
        server_params = StreamableHttpServerParams(
            url="https://mcp.example.com/mcp",
            timeout=30.0,
            terminate_on_close=True,
        )
    
        # Load the get_current_time tool from the server
        adapter = await StreamableHttpMcpToolAdapter.from_server_params(server_params, "get_current_time")
    
        # Fetch model name from KAITO's local OpenAI-compatible API
        model = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy").models.list().data[0].id
    
        model_info: ModelInfo = {
            "vision": False,
            "function_calling": True,
            "json_output": True,
            "family": ModelFamily.UNKNOWN,
            "structured_output": True,
            "multiple_system_messages": True,
        }
    
        # Connect to the KAITO inference workspace
        model_client = OpenAIChatCompletionClient(
            base_url="http://localhost:8000/v1",
            api_key="dummy",
            model=model,
            model_info=model_info
        )
    
        # Define the assistant agent
        agent = AssistantAgent(
            name="time-assistant",
            model_client=model_client,
            tools=[adapter],
            system_message="You are a helpful assistant that can provide time information."
        )
    
        # Run a test task that invokes the tool
        await Console(
            agent.run_stream(
                task="What time is it in Europe/Paris?",
                cancellation_token=CancellationToken()
            )
        )
    
    if __name__ == "__main__":
        asyncio.run(main())
    
  5. Run the test script in your virtual environment.

    uv run test.py
    

    In the output of this test, you should expect the following:

    • The model correctly generates a tool call using the MCP name and expected arguments.
    • Autogen sends the tool call to the MCP server, the MCP server runs the logic and returns a result.
    • The Phi-4-mini-instruct LLM interprets the raw tool output and provides a natural language response.
    ---------- TextMessage (user) ----------
    What time is it in Europe/Paris?
    
    ---------- ToolCallRequestEvent (time-assistant) ----------
    [FunctionCall(id='chatcmpl-tool-xxxx', arguments='{"timezone": "Europe/Paris"}', name='get_current_time')]
    
    ---------- ToolCallExecutionEvent (time-assistant) ----------
    [FunctionExecutionResult(content='{"timezone":"Europe/Paris","datetime":"2025-09-17T17:43:05+02:00","is_dst":true}', name='get_current_time', call_id='chatcmpl-tool-xxxx', is_error=False)]
    
    ---------- ToolCallSummaryMessage (time-assistant) ----------
    The current time in Europe/Paris is 5:43 PM (CEST).
    

Experiment with more MCP tools

You can test the various tools available to this MCP server, such as convert_time.

  1. In your test.py file from the previous step, update your adapter definition to the following:

    adapter = await StreamableHttpMcpToolAdapter.from_server_params(server_params, "convert_time")
    
  2. Update your task definition to invoke the new tool. For example:

    task="Convert 9:30 AM New York time to Tokyo time."
    
  3. Save and run the Python script.

    uv run test.py
    

    Expected output:

    9:30 AM in New York is 10:30 PM in Tokyo.
    

Troubleshooting

The following table outlines common errors when testing KAITO inference with an external MCP server and how to resolve them:

Error How to resolve
Tool not found Ensure that your tool name matches the one declared in /mcp/list_tools.
401 Unauthorized If your MCP server requires an Auth token, make sure to update server_params to include headers with the Auth token.
connection refused Ensure the KAITO inference service is port-forwarded properly (e.g. to localhost:8000).
tool call ignored Review the KAITO tool calling documentation to find vLLM models that support tool calling.

Next steps

In this article, you learned how to connect a KAITO workspace to an external reference MCP server using Autogen to enable tool calling through the OpenAI-compatible API. You also validated that the LLM could discover, invoke, and integrate results from MCP-compliant tools on AKS. To learn more, see the following resources: