Compile Hugging Face models to run on Foundry Local

2025-10-01

Important

Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

Foundry Local runs ONNX models on your device with high performance. Although the model catalog offers precompiled options out of the box, any model in the ONNX format works.

Use Olive to compile models in Safetensor or PyTorch format to ONNX. Olive optimizes models for ONNX, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.

This guide shows how to:

Convert and optimize models from Hugging Face to run in Foundry Local. The examples use the Llama-3.2-1B-Instruct model, but any generative AI model from Hugging Face works.
Run your optimized models with Foundry Local.

Prerequisites

Python 3.10 or later

Install Olive

Olive optimizes models and converts them to the ONNX format.

Bash
PowerShell

pip install olive-ai[auto-opt]

pip install olive-ai[auto-opt]

Tip

Install Olive in a virtual environment with venv or conda.

The Llama-3.2-1B-Instruct model requires Hugging Face authentication.

Bash
PowerShell

huggingface-cli login

huggingface-cli login

Note

Create a Hugging Face token and request model access before proceeding.

Compile the model

Step 1: Run the Olive auto-opt command

Use the Olive auto-opt command to download, convert, quantize, and optimize the model:

Bash
PowerShell

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

olive auto-opt `
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct `
    --trust_remote_code `
    --output_path models/llama `
    --device cpu `
    --provider CPUExecutionProvider `
    --use_ort_genai `
    --precision int4 `
    --log_level 1

Note

The compilation process takes about 60 seconds, plus download time.

The command uses the following parameters:

Parameter	Description
`model_name_or_path`	Model source: Hugging Face ID, local path, or Azure AI Model registry ID
`output_path`	Where to save the optimized model
`device`	Target hardware: `cpu`, `gpu`, or `npu`
`provider`	Execution provider (for example, `CPUExecutionProvider`, `CUDAExecutionProvider`)
`precision`	Model precision: `fp16`, `fp32`, `int4`, or `int8`
`use_ort_genai`	Creates inference configuration files

Tip

If you have a local copy of the model, you can use a local path instead of the Hugging Face ID. For example, --model_name_or_path models/llama-3.2-1B-Instruct. Olive handles the conversion, optimization, and quantization automatically.

Step 2: Rename the output model

Olive creates a generic model directory. Rename it for easier reuse:

Bash
PowerShell

cd models/llama
mv model llama-3.2

cd models/llama
Rename-Item -Path "model" -NewName "llama-3.2"

Step 3: Create chat template file

A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.

Foundry Local requires a chat template JSON file named inference_model.json to generate responses. The template includes the model name and a PromptTemplate object. The object contains a {Content} placeholder that Foundry Local injects at runtime with the user prompt.

{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

Create the chat template file with the apply_chat_template method from the Hugging Face library:

Note

This example uses the Hugging Face library (a dependency of Olive) to create a chat template. If you're using the same Python virtual environment, you don't need to install it. In a different environment, install it with pip install transformers.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/llama-3.2"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

Run the script using:

python generate_inference_model.py

Run the model

Run your compiled model with the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:

Bash
PowerShell

foundry cache cd models
foundry cache ls  # should show llama-3.2

foundry cache cd models
foundry cache ls  # should show llama-3.2

Caution

Change the model cache back to the default directory when you're done:

foundry cache cd ./foundry/cache/models

foundry model run llama-3.2 --verbose

foundry model run llama-3.2 --verbose

Using the OpenAI Python SDK

Use the OpenAI Python SDK to interact with the Foundry Local REST API. Install it with:

pip install openai
pip install foundry-local-sdk

Then run the model with the following code:

import openai
from foundry_local import FoundryLocalManager

modelId = "llama-3.2"

# Create a FoundryLocalManager instance. This starts the Foundry Local service if it's not already running and loads the specified model.
manager = FoundryLocalManager(modelId)

# The remaining code uses the OpenAI Python SDK to interact with the local model.

# Configure the client to use the local Foundry service
client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(
    model=manager.get_model_info(modelId).id,
    messages=[{"role": "user", "content": "What is the golden ratio?"}],
    stream=True
)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tip

Use any language that supports HTTP requests. For more information, see Integrated inferencing SDKs with Foundry Local.

Reset the model cache

After you finish using the custom model, reset the model cache to the default directory:

foundry cache cd ./foundry/cache/models

Feedback

Was this page helpful?

Share via

Compile Hugging Face models to run on Foundry Local

Prerequisites

Install Olive

Sign in to Hugging Face

Compile the model

Step 1: Run the Olive auto-opt command

Step 2: Rename the output model

Step 3: Create chat template file

Run the model

Using the Foundry Local CLI

Using the OpenAI Python SDK

Reset the model cache

Related content

Feedback

Additional resources