Hello Himanshu,
Thanks for getting back to me. --resource-group is not an argument to az batch pool list, but if I try without I get this result:
az batch pool list --account-name <batchaccountname>
<urllib3.connection.HTTPSConnection object at 0x1046715e0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known
--force-delete is not an argument to az batch pool delete , but deleting via CLI also didn't work without.
The "Stop" Button is Destructive so if in an Azure Batch Pool, the Stop action in the portal is a "deallocate" operation. If you interrupt a node while it is in running tasks or during its own internal cleanup process, it can get stuck in a transient state like leaving pool. The node is neither fully operational nor fully deallocated.
The node had already been in a leavingpool state for a day before I clicked the stop button - I think it may have been what triggered it into finally getting fully cleaned up.
Terraform initiates a delete command and waits for a successful response from Azure. If the Azure Batch service acknowledges the delete but the compute resources (Virtual Machiness or disks) take too long or they get stuck, Terraform's operation gets time out but the process continues (or fails) on the Azure side and this leaves the resource in a "Deleting" ghost state.
What resource are you referring to here? The Azure resource? I'm not sure why a timeout on the client-side would impact resource cleanup on the server-side. If you're referring to the terraform resource: I'm not worried about that, this is a development subscription where I can simply start from scratch if need be.
I'm going to wait to try moving the batch account to a different resource group as I'm hoping that internal support will be able to take a look at what's going wrong and instruct me as to how I can avoid this situation in the future.