Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
- Foundry Local is available in preview. Public preview releases provide early access to features that are in active deployment.
- Features, approaches, and processes can change or have limited capabilities, before General Availability (GA).
Foundry Local runs ONNX models on your device with high performance. Although the model catalog offers precompiled options out of the box, any model in the ONNX format works.
Use Olive to compile models in Safetensor or PyTorch format to ONNX. Olive optimizes models for ONNX, making them suitable for deployment in Foundry Local. It uses techniques like quantization and graph optimization to improve performance.
This guide shows how to:
- Convert and optimize models from Hugging Face to run in Foundry Local. The examples use the
Llama-3.2-1B-Instructmodel, but any generative AI model from Hugging Face works. - Run your optimized models with Foundry Local.
Prerequisites
- Python 3.10 or later
Install Olive
Olive optimizes models and converts them to the ONNX format.
pip install olive-ai[auto-opt]
Sign in to Hugging Face
The Llama-3.2-1B-Instruct model requires Hugging Face authentication.
huggingface-cli login
Note
Create a Hugging Face token and request model access before proceeding.
Compile the model
Step 1: Run the Olive auto-opt command
Use the Olive auto-opt command to download, convert, quantize, and optimize the model:
olive auto-opt \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--trust_remote_code \
--output_path models/llama \
--device cpu \
--provider CPUExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
Note
The compilation process takes about 60 seconds, plus download time.
The command uses the following parameters:
| Parameter | Description |
|---|---|
model_name_or_path |
Model source: Hugging Face ID, local path, or Azure AI Model registry ID |
output_path |
Where to save the optimized model |
device |
Target hardware: cpu, gpu, or npu |
provider |
Execution provider (for example, CPUExecutionProvider, CUDAExecutionProvider) |
precision |
Model precision: fp16, fp32, int4, or int8 |
use_ort_genai |
Creates inference configuration files |
Tip
If you have a local copy of the model, you can use a local path instead of the Hugging Face ID. For example, --model_name_or_path models/llama-3.2-1B-Instruct. Olive handles the conversion, optimization, and quantization automatically.
Step 2: Rename the output model
Olive creates a generic model directory. Rename it for easier reuse:
cd models/llama
mv model llama-3.2
Step 3: Create chat template file
A chat template is a structured format that defines how input and output messages are processed for a conversational AI model. It specifies the roles (for example, system, user, assistant) and the structure of the conversation, ensuring that the model understands the context and generates appropriate responses.
Foundry Local requires a chat template JSON file named inference_model.json to generate responses. The template includes the model name and a PromptTemplate object. The object contains a {Content} placeholder that Foundry Local injects at runtime with the user prompt.
{
"Name": "llama-3.2",
"PromptTemplate": {
"assistant": "{Content}",
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{Content}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
}
}
Create the chat template file with the apply_chat_template method from the Hugging Face library:
Note
This example uses the Hugging Face library (a dependency of Olive) to create a chat template. If you're using the same Python virtual environment, you don't need to install it. In a different environment, install it with pip install transformers.
# generate_inference_model.py
# This script generates the inference_model.json file for the Llama-3.2 model.
import json
import os
from transformers import AutoTokenizer
model_path = "models/llama/llama-3.2"
tokenizer = AutoTokenizer.from_pretrained(model_path)
chat = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "{Content}"},
]
template = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
json_template = {
"Name": "llama-3.2",
"PromptTemplate": {
"assistant": "{Content}",
"prompt": template
}
}
json_file = os.path.join(model_path, "inference_model.json")
with open(json_file, "w") as f:
json.dump(json_template, f, indent=2)
Run the script using:
python generate_inference_model.py
Run the model
Run your compiled model with the Foundry Local CLI, REST API, or OpenAI Python SDK. First, change the model cache directory to the models directory you created in the previous step:
foundry cache cd models
foundry cache ls # should show llama-3.2
Caution
Change the model cache back to the default directory when you're done:
foundry cache cd ./foundry/cache/models
Using the Foundry Local CLI
foundry model run llama-3.2 --verbose
Using the OpenAI Python SDK
Use the OpenAI Python SDK to interact with the Foundry Local REST API. Install it with:
pip install openai
pip install foundry-local-sdk
Then run the model with the following code:
import openai
from foundry_local import FoundryLocalManager
modelId = "llama-3.2"
# Create a FoundryLocalManager instance. This starts the Foundry Local service if it's not already running and loads the specified model.
manager = FoundryLocalManager(modelId)
# The remaining code uses the OpenAI Python SDK to interact with the local model.
# Configure the client to use the local Foundry service
client = openai.OpenAI(
base_url=manager.endpoint,
api_key=manager.api_key # API key is not required for local usage
)
# Set the model to use and generate a streaming response
stream = client.chat.completions.create(
model=manager.get_model_info(modelId).id,
messages=[{"role": "user", "content": "What is the golden ratio?"}],
stream=True
)
# Print the streaming response
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
Tip
Use any language that supports HTTP requests. For more information, see Integrated inferencing SDKs with Foundry Local.
Reset the model cache
After you finish using the custom model, reset the model cache to the default directory:
foundry cache cd ./foundry/cache/models