Edit

Share via


Managing libraries in environment secured by outbound access protection

Microsoft Fabric allows admins to control and restrict outbound connections from workspace items to external resources. When outbound network security is on, it blocks access to public repositories like PyPI and conda. This prevents installing public libraries or downloading dependencies for custom packages.

This article covers how to install libraries from PyPI when the outbound access protection is enabled for your workspace.

In an environment with restricted outbound access, the service isn't able to connect to public repo to download libraries and its dependencies. We recommend you to directly upload the packages and their dependencies as custom packages in the environment.

Step 1: Prerequisites

To get started, you need to prepare your libraries specification asrequirement.txt, a compute resource that can be used to build a Python virtual environment, and the setup file for Fabric runtime.

Important

  • The Runtime setup file contains a few Microsoft hosted private libraries, which can't be recognized. Make sure to remove them from the setup file.
  • Libraries hosted by Microsoft: 'library-metadata-cooker', 'mmlspark', 'azureml-synapse', 'notebookutils', 'flt-python', 'synapse-jupyter-notebook', 'synapse-jupyter-proxy', 'azure-synapse-ml-predict', 'fsspec_wrapper', 'horovod', 'sqlanalyticsconnectorpy', 'synapseml', 'control-script', 'impulse-python-handler', 'chat-magics', 'ds-copilot', 'fabric-connection', 'chat-magics-fabric', 'dscopilot-installer', 'sqlanalyticsfabricconnectorpy', 'geoanalytics-fabric', 'spark-mssql-connector-fabric35', 'flaml', 'semantic-link-sempy', 'synapseml-*', 'prose-pandas2pyspark', 'prose-suggestions', 'kqlmagiccustom'

Screenshot that shows the example of Runtime setup file.

Step 2: Set up the Virtual Python Environment in your compute resource

Create the Virtual Python Environment, which aligns with the Fabric Runtime's Python version, in your compute resource by running the following script.

wget https://repo.anaconda.com/miniconda/Miniconda3-py310_24.1.2-0-Linux-x86_64.sh
bash Miniconda3-py310_24.1.2-0-Linux-x86_64.sh
chmod 755 -R /usr/lib/miniforge3/
export PATH="/usr/lib/miniforge3/bin:$PATH"
sudo apt-get update
sudo apt-get -yq install gcc g++
conda env create -n <custom-env-name> -f Python311-CPU.yml
source activate <custom-env-name>

Step 3: Identify and download the required wheels

The following script can be used to pass your requirements.txt file, which has all the packages and versions that you intend to install in the spark runtime. It prints the names of the new wheel files/dependencies for your input library requirements.

pip install -r <input-user-req.txt> > pip_output.txt
cat pip_output.txt | grep "Using cached *"

Now, you can download the listed wheels from PyPI and directly upload them to Fabric Environment.

Host PyPI mirror on Azure Storage account

A PyPI mirror is a replica of the official PyPI repository. The replica can either be a full replica of PyPI or a partial replica and can be hosted in several ways within Azure. We recommend using Azure Storage account. The storage account is protected behind organization vNet, so that only approved targets/endpoints can access it.

Important

Host PyPI mirror is ideal for organizations that rely on a large set of PyPI libraries and prefer not to manage individual wheel files manually. This approach comes with the onus of organizations bearing the setup costs and periodic monitoring and updating to keep the mirror in sync with PyPI.

Step 1: Prerequisites

  • Compute resources: Linux system, Windows Subsystem for Linux, or an Azure VM
  • Azure Storage Account: To store the mirrored packages.
  • Other utilities: Bandersnatch, i,e., the PyPI mirroring tool that handles synchronization. Az CLI or Blobefuse2 or Azcopy, that is, the utility for efficient file synchronization

Initial Setup

The size of the entire PyPI repository is large and constantly growing. There is a one-time initial effort of setting up the entire PyPI repository. See the PyPI Statistics.

Maintenance

To keep the mirror in sync with PyPI, periodic monitoring and updating are required.

Note

The followings are various factors of your compute resource that contribute to the setup and maintenance effort:

  • Network speed
  • Server resources: CPU, memory, disk I/O of the compute resource running Bandersnatch impacts the synchronization speed
  • Disk speed: the speed of the storage system can affect how quickly Bandersnatch can write data to disk.
  • Initial setup vs. Maintenance sync: the initial sync (when you first set up Bandersnatch) generally takes longer as it downloads the entire repository. On a typical setup with decent network and hardware, it might range from 8 to 48 hours. Subsequent syncs, which only update new or changed packages, are faster.

Step 2: Set up Python on your compute resource

Run the following script to set up the corresponding Python version.

wget https://repo.anaconda.com/miniconda/Miniconda3-py310_24.1.2-0-Linux-x86_64.sh
bash Miniconda3-py310_24.1.2-0-Linux-x86_64.sh
chmod 755 -R /usr/lib/miniforge3/

# Add Python executable to PATH
export PATH="/usr/lib/miniforge3/bin:$PATH"

Step 2: Set up Bandersnatch

Bandersnatch is a PyPI mirroring tool that downloads all of PyPI and associated index files on local filesystem. You can refer to this article to create a bandersnatch.conf file.

Run the following script to setup Bandersnatch. The command performs a one-time synchronization with PyPI. The initial sync takes time to run.

# Install Bandersnatch
pip install bandersnatch

# Execute mirror command
bandersnatch --config <path-to-bandersnatch.conf> mirror

Note

The Mirror filtering tool supports partial mirroring through plugins like allowlist, enabling more efficient management of dependencies. By filtering unnecessary packages, it helps reduce the size of the full mirror, minimizing both cost and maintenance effort. For example, if the mirror is intended solely for Fabric, you can exclude Windows binaries to optimize storage. We recommend evaluating these filtering options based on your specific use case.

After the commands are executed successfully, the sub folders in your mirror directory on local filesystem will be created.

Screenshot that shows the PyPI mirror created by bandersnatch.

Step 3: Verify local mirror setup (optional)

You can use a HTTP server to serve your local PyPI mirror. This command starts a HTTP server on port 8000 that serves the contents of the mirror directory.

cd <directory-to-mirror>
python -m http.server 8000

# Configure pip to Use the Local PyPI Mirror
pip install <package> -index-url http://localhost:8000/simple

Step 4: Upload mirror on storage account

Enable Static Website on your Azure storage account. This step allows you to host static content like PyPI index page in this case. It also automatically generates a container named $web.

Screenshot that shows the storage account example.

And then, you can use either az CLI or azcopy of blobfuse2 to upload the local mirror from your devbox to your Azure storage account.\

  • Upload packages folder to your chosen container on the storage account container
  • Upload simple, pypi, local-stats, and json folders to $web container of your storage account

Step 5: Use this mirror in Fabric Environment

In order to access the Azure Storage account, add two private managed endpoints in the Fabric workspace.

Screenshot that shows the private endpoints example.

And then, you can install the library from the Azure storage account by providing the YAML file in Environment or use inline %pip install in Notebook session.

  • YAML file example:
dependencies:
  - pip
  - pip:
    - pytest==8.2.2
    - --index-url https://<storage-account-name>.z5.web.core.windows.net/simple
  • %pip command example
%pip install pytest --index-url https://<storage-account-name>.z5.web.core.windows.net/simple