Dependency caching in AzureML pipeline

Marco Bignotti 50 Reputation points
2025-10-01T09:01:34.6833333+00:00

Hi!

I am using the uv python package manager to run my pipelines. For instance, I have a data prep component where the entry point is:

code: ../../../../
command: uv run --extra extra_group --no-dev --locked src/my_package/train.py
environment: azureml:my-dev-env@latest

In this way, I can submit the job and automatically sync both the dependencies and and the changes made to my project/package. However, I am losing the dependency caching feature of uv that makes it so fast and, each time I re-submit a job, it will re-install everything (that might take some time if heavy packages are included). Do you know if there's a way to cache the dependencies and benefit from it across re-runs?

Thank you so much!

Azure Machine Learning
0 comments No comments
{count} vote

Answer accepted by question author
  1. Alex Burlachenko 18,390 Reputation points Volunteer Moderator
    2025-10-01T09:47:51.3633333+00:00

    Marco hey,

    you are right, losing that uv cache speed is a huge pain point for iterative development.

    the core issue is that each time azureml starts a new job, it provisions a fresh compute node. that node has a clean disk, so the uv cache from a previous job is gone. to keep the cache, you need a way to persist the ~/.cache/uv directory between job runs.

    you can mount an azure blob storage container or a file share to your training job. then, you can configure uv to use a custom cache directory located on that mounted storage.

    create a datastore that points to an azure storage account. then, in your job configuration, mount this datastore to a path inside the container, for example /uv_cache.

    you need to tell uv to use this mounted path. you can set the UV_CACHE_DIR environment variable in your job to point to /uv_cache. uv will then use this persistent location for its cache instead of the local volatile disk.

    your command would look something like this.

    UV_CACHE_DIR=/uv_cache uv run --extra extra_group --no-dev --locked src/my_package/train.py

    the first time the job runs, it will populate the cache in the blob storage. on subsequent runs, uv will find the already downloaded and compiled packages in that persistent location, making the dependency installation step much faster.

    this pattern of using a mounted storage for a package cache is a universal technique. you could do the same thing for pip's cache or conda packages to speed up any python environment setup.

    mount a persistent datastore to your job and set the UV_CACHE_DIR environment variable to a path on that mount. this should preserve your uv cache across pipeline runs.

    regards,

    Alex

    and "yes" if you would follow me at Q&A - personaly thx.
    P.S. If my answer help to you, please Accept my answer
    

    https://ctrlaltdel.blog/

    2 people found this answer helpful.

Answer accepted by question author
  1. Nikhil Jha (Accenture International Limited) 2,220 Reputation points Microsoft External Staff Moderator
    2025-10-17T20:02:03.3166667+00:00

    Hello Marco Bignotti,

    The error message, Function not implemented (os error 38), is the key. This almost certainly means the underlying storage you're mounting with uri_folder is an Azure Blob Container. The rw_mount for Blob Storage is FUSE-based, and it does not support the full set of POSIX file system operations (like hard linking or certain file rename operations) that uv's cache needs to function.

    The good news is that your YAML structure for the component (inputs: cache_dir: type: uri_folder...) is perfectly correct for AzureML v2. The problem isn't the YAML; it's the type of storage you are passing to it.

    As Alex correctly pointed out, you must use a storage service that supports a full file system.

    Recommendation**:** Use an Azure File Share Datastore

    The fix is to create a new datastore in your workspace that points to an Azure File Share (which uses the CIFS/SMB protocol) instead of a Blob Container. This will provide a true, POSIX-compliant file system that uv's cache can interact with.

    Steps:

    1. Create an Azure File Share: In the Azure portal, go to a Storage Account (or create a new one) and create a new File Share. Let's call it uvcache.
    2. Create a New Datastore: In your AzureML Workspace, go to Datastores and create a new one.
      • Datastore type: Select "Azure File Share".
      • Name: Give it a clear name, e.g., uv_cache_fileshare.
      • Point it to the Storage Account and File Share (uvcache) you just created.
    3. Update Your Pipeline Job YAML: Your component YAML is correct and doesn't need to change. You just need to update the pipeline job that calls this component to pass in the new File Share datastore.
    # pipeline-job.yml
    $schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
    type: pipeline
    jobs:
      my_prep_job:
        type: command
        component: azureml:my_prep_component@latest #This is component with the cache_dir input
        inputs:
          # This is the key change:
          # Point to your NEW Azure File Share datastore
          cache_dir:
            type: uri_folder
            path: azureml://datastores/uv_cache_fileshare/path/uv_cache_data
            mode: rw_mount
    

    Your component's command will work perfectly: uv run --extra extra_group --no-dev --locked --cache-dir ${{inputs.cache_dir}} src/my_package/train.py


    Please accept the answer and upvote for visibility & remedition of other community members, facing similar challenge.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.