Azure Machine Learning Workspace
Azure Machine Learning is a cloud service for accelerating and managing the machine learning project lifecycle. Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows: Train and deploy models, and manage MLOps.
Made by
Massdriver
Official
Yes
Clouds
Tags
Azure Machine Learning Workspace
Azure Machine Learning Workspace is a foundational resource in Azure that provides a collaborative environment where data scientists can build, train, and deploy machine learning models.
Design Decisions
-
Security:
- The workspace employs Azure Key Vault for storing sensitive information and encryption keys.
- Managed identities are used for secure and seamless access.
-
Identity Management:
- UserAssigned Identity is used for both compute instances and clusters to ensure role-based access control.
- Azure roles are assigned to different resources ensuring least-privilege access.
-
Scalability:
- Compute clusters are configured with auto-scaling settings to optimize resource use based on workload demands.
- Instances can be dedicated and customized per workload requirements.
-
High Availability:
- Ensures resources are distributed across availability zones to minimize service disruption.
Runbook
Unable to Access Azure Machine Learning Workspace
If you are experiencing issues accessing the Azure Machine Learning Workspace, the following troubleshooting steps might help.
Check Role Assignments and Permissions
Ensure the user or service principal has the required roles assigned.
az role assignment list --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-name> --assignee <user-or-service-principal-id>
Expected Output: A list of roles assigned to the user or service principal. Verify the "AzureML Data Scientist" or similar roles are included.
Compute Instance Not Starting
If a compute instance is not starting, you can use the following commands to diagnose the issue.
Check Instance Status
az ml compute show -n <instance-name> -w <workspace-name> -g <resource-group>
Expected Output: The status of the compute instance. Look for states like "creating", "running", or "failed".
Review Activity Logs
az monitor activity-log list --resource-group <resource-group> --resource-id <compute-instance-resource-id>
Expected Output: Details of recent activities and errors related to the compute instance.
Poor Performance of Machine Learning Experiment
If you notice poor performance with your machine learning experiment, the following steps can help identify bottlenecks.
Check VM Utilization
Azure CLI command to check the current utilization of the VM:
az vm list-vm-resize-options --resource-group <resource-group> --name <vm-name> --query "[?location=='<location>']"
Expected Output: A list of possible VM sizes and the current VM configuration. Ensure the VM size is appropriate for the workload.
Log into the Compute Instance
Use SSH or RDP (based on configuration) to log into the compute instance and check system metrics like CPU, memory, and disk usage.
# SSH Example
ssh <username>@<vm-ip-address>
top
Expected Output: High CPU or memory utilization indicating resource constraints.
Issues with Data Access in Storage Account
If there are issues accessing data in the associated storage account:
Check Storage Account Permissions
Ensure the necessary roles are assigned for the storage account.
az role assignment list --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account-name> --assignee <user-or-service-principal-id>
Expected Output: A list of roles assigned to the user or service principal. Verify roles like "Storage Blob Data Contributor" are included.
Key Vault Access Issues
If you encounter access issues with Azure Key Vault, use the following steps:
Check Key Vault Access Policies
Ensure the relevant access policies are in place.
az keyvault show --name <keyvault-name> --query "properties.accessPolicies"
Expected Output: Access policies configured for the Key Vault. Validate policies for required permissions like "Get", "List", "Set", etc.
Variable | Type | Description |
---|---|---|
compute.cluster[].idle_duration | string | The Idle time before scaling down the cluster to the minimum node count. |
compute.cluster[].max_nodes | integer | The maximum number of nodes in the compute cluster. |
compute.cluster[].min_nodes | integer | The minimum number of nodes in the compute cluster. |
compute.cluster[].name | string | The name of the compute cluster. Must be unique within the region. |
compute.cluster[].size | string | The size of the compute cluster. |
compute.instance[].name | string | The name of the compute instance. Must be unique within the region. |
compute.instance[].size | string | The size of the compute instance. |
compute.instance[].user | string | Must be a valid Object ID for the Azure AD user. |
workspace.high_business_impact | boolean | If your workspace contains sensitive data, you can enable high business impact features to help protect your data. This controls the amount of data Microsoft collects for diagnostic purposes and enables additional encryption in Microsoft managed environments. This cannot be changed after the workspace is created. |
workspace.location | string | The region of the workspace. This cannot be changed after the workspace is created. |