Edit

Share via


Compile Hugging Face models to run on Foundry Local

Important

  • Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
  • Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).

Foundry Local runs ONNX models on your device with high performance. Although the model catalog offers precompiled options out of the box, any model in the ONNX format works.

Use Olive to compile models in Safetensor or PyTorch format to ONNX. Olive optimizes models for ONNX, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.

This guide shows how to:

  • Convert and optimize models from Hugging Face to run in Foundry Local. The examples use the Llama-3.2-1B-Instruct model, but any generative AI model from Hugging Face works.
  • Run your optimized models with Foundry Local.

Prerequisites

  • Python 3.10 or later

Install Olive

Olive optimizes models and converts them to the ONNX format.

pip install olive-ai[auto-opt]

Tip

Install Olive in a virtual environment with venv or conda.

Sign in to Hugging Face

The Llama-3.2-1B-Instruct model requires Hugging Face authentication.

huggingface-cli login

Compile the model

Step 1: Run the Olive auto-opt command

Use the Olive auto-opt command to download, convert, quantize, and optimize the model:

olive auto-opt \
    --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
    --trust_remote_code \
    --output_path models/llama \
    --device cpu \
    --provider CPUExecutionProvider \
    --use_ort_genai \
    --precision int4 \
    --log_level 1

Note

The compilation process takes about 60 seconds, plus download time.

The command uses the following parameters:

Parameter Description
model_name_or_path Model source: Hugging Face ID, local path, or Azure AI Model registry ID
output_path Where to save the optimized model
device Target hardware: cpu, gpu, or npu
provider Execution provider (for example, CPUExecutionProvider, CUDAExecutionProvider)
precision Model precision: fp16, fp32, int4, or int8
use_ort_genai Creates inference configuration files

Tip

If you have a local copy of the model, you can use a local path instead of the Hugging Face ID. For example, --model_name_or_path models/llama-3.2-1B-Instruct. Olive handles the conversion, optimization, and quantization automatically.

Step 2: Rename the output model

Olive creates a generic model directory. Rename it for easier reuse:

cd models/llama
mv model llama-3.2

Step 3: Create chat template file

A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.

Foundry Local requires a chat template JSON file named inference_model.json to generate responses. The template includes the model name and a PromptTemplate object. The object contains a {Content} placeholder that Foundry Local injects at runtime with the user prompt.

{
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
  }
}

Create the chat template file with the apply_chat_template method from the Hugging Face library:

Note

This example uses the Hugging Face library (a dependency of Olive) to create a chat template. If you're using the same Python virtual environment, you don't need to install it. In a different environment, install it with pip install transformers.

# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer

model_path = "models/llama/llama-3.2"

tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "{Content}"},
]


template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

json_template = {
  "Name": "llama-3.2",
  "PromptTemplate": {
    "assistant": "{Content}",
    "prompt": template
  }
}

json_file = os.path.join(model_path, "inference_model.json")

with open(json_file, "w") as f:
    json.dump(json_template, f, indent=2)

Run the script using:

python generate_inference_model.py

Run the model

Run your compiled model with the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:

foundry cache cd models
foundry cache ls  # should show llama-3.2

Caution

Change the model cache back to the default directory when you're done:

foundry cache cd ./foundry/cache/models

Using the Foundry Local CLI

foundry model run llama-3.2 --verbose

Using the OpenAI Python SDK

Use the OpenAI Python SDK to interact with the Foundry Local REST API. Install it with:

pip install openai
pip install foundry-local-sdk

Then run the model with the following code:

import openai
from foundry_local import FoundryLocalManager

modelId = "llama-3.2"

# Create a FoundryLocalManager instance. This starts the Foundry Local service if it's not already running and loads the specified model.
manager = FoundryLocalManager(modelId)

# The remaining code uses the OpenAI Python SDK to interact with the local model.

# Configure the client to use the local Foundry service
client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key  # API key is not required for local usage
)

# Set the model to use and generate a streaming response
stream = client.chat.completions.create(
    model=manager.get_model_info(modelId).id,
    messages=[{"role": "user", "content": "What is the golden ratio?"}],
    stream=True
)

# Print the streaming response
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tip

Use any language that supports HTTP requests. For more information, see Integrated inferencing SDKs with Foundry Local.

Reset the model cache

After you finish using the custom model, reset the model cache to the default directory:

foundry cache cd ./foundry/cache/models