Hello Angus McKay,
I understand how frustrating it can be when an Azure Machine Learning (AML) compute instance gets stuck on ‘Starting up’ (or ‘Setting up’) and continues to accrue costs, especially when it is part of a pipeline. Let’s go through some ways to resolve this without deleting the compute.
Check the Compute Status via Azure CLI
Sometimes the portal may show the instance as stuck, but you can inspect and manage it using the CLI:
# List compute instances in your workspace
az ml compute list --workspace-name <workspace-name> --resource-group <resource-group>
# Check the state of the specific compute instance
az ml compute show --name <compute-name> --workspace-name <workspace-name> --resource-group <resource-group>
If the instance is in a ‘Creating’ or ‘Starting’ state for an unusually long time, you can attempt to restart it:
az ml compute restart --name <compute-name> --workspace-name <workspace-name> --resource-group <resource-group>
Even if the portal’s stop button isn’t working, the CLI may succeed:
az ml compute stop --name <compute-name> --workspace-name <workspace-name> --resource-group <resource-group>
This can sometimes push the compute into a recoverable state without deleting it.
Check Activity Logs and Metrics in the Azure Portal for the compute resource.
Look for failed provisioning events or quota issues that might prevent the VM from starting.
Ensure your workspace has sufficient vCPU and GPU quotas for the compute SKU you’re using.
If the above steps fail:
You can clone the pipeline or notebook jobs using a new compute instance of the same SKU.
This avoids deleting your old compute but let's your work continue while support investigates the stuck instance.
I hope this is helpful to you. Let me know if you have any other questions.
Thank you!