Share via


Set up data quality for Azure Databricks Unity Catalog

To use Unity Catalog, you must enable your Azure Databricks workspace for Unity Catalog by attaching the workspace to a Unity Catalog metastore. All new workspaces are automatically enabled for Unity Catalog when you create them, but an account admin might need to enable Unity Catalog manually for older workspaces. Regardless of whether your workspace is enabled for Unity Catalog automatically, you need to complete the following steps to get started with Unity Catalog:

  • Create catalogs and schemas to contain database objects like tables and volumes.
  • Create managed storage locations to store the managed tables and volumes in these catalogs and schemas.
  • Grant user access to catalogs, schemas, and database objects.

Workspaces that are automatically enabled for Unity Catalog provision a workspace catalog with broad privileges granted to all workspace users. This catalog is a convenient starting point for trying out Unity Catalog.

For detailed setup instructions, see Set up and manage Unity Catalog.

When you're scanning Azure Databricks Unity Catalog, Microsoft Purview supports:

  • Metastore
  • Catalogs
  • Schemas
  • Tables including the columns
  • Views including the columns

When setting up scan, you can choose to scan the entire Unity Catalog, or scope the scan to a subset of catalogs.

Configure Data Map scan to catalog Databricks Unity Catalog data in Microsoft Purview

  • Register an Azure Databricks workspace in Microsoft Purview
  • Scan registered Azure Databricks workspace
    • Enter the name of scan
    • Select unity catalog as extraction method
    • Connect via integration runtime (Azure integration runtime, Managed Virtual Network IR, or a Kubernetes supported self-hosted integration runtime you created)
    • Select Access Token Authentication while creating a credential. For more information, see Credentials for source authentication in Microsoft Purview.
      • Specify the Databricks SQL Warehouse’s HTTP path that Microsoft Purview connects to and performs the scan
    • In Scope your scan page, select the catalogs you want to scan.
    • Select a scan rule set for classification. You can choose between the system default, existing custom rule sets, or create a new rule set inline. Check the Classification article to learn more.
    • For Scan trigger, choose whether to set up a schedule or run the scan once.
    • Review your scan and select Save and Run.
  • View your scans and scan run to complete cataloging your data.

Once scanned, the data asset in Unity Catalog (UC) will be available in Microsoft Purview Unified Catalog search. Find more details about how to connect and manage Azure Databricks Unity Catalog in Microsoft Purview.

Important

  • Select Access Token Authentication while creating a credential.
  • Place Access Token on your hosted Azure Key Vault and connect the key vault to the connection manager.
  • Make sure to provide product (service) MSI read (secret) access to the Key Vault.

Set up connection to Databricks Unity Catalog for a data quality scan

At this point, you have the scanned asset ready for cataloging and governance. Associate the scanned asset to the data product in a governance domain. At the Data Quality Tab, add a new Azure SQL Database Connection: Get the Database Name entered manually.

  1. In the Microsoft Purview portal, open Unified Catalog.

  2. Under Health management, select Data quality.

  3. Select a governance domain from the list, then select Connections from the Manage dropdown list.

  4. Configure connection on the Connections page:

    • Add connection name and description.
    • Select source type Azure Databricks.
    • Select Azure subscription.
    • Select workspace URL.
    • Add Databricks metastore ID.
    • Select Unity catalog as extraction method.
    • Select HTTP path.
    • Select unity catalog name.
    • Select schema name.
    • Select table name.
    • Select authentication method - Access Token
      • Add Azure subscription
      • Key vault connection
      • Secret name
      • Secret version
    • Select the Enable managed V-Net checkbox if your Databricks is running in virtual network.
    • Region is selected automatically.
    • Create a new virtual network if a virtual network storage hasn't yet been created.
  5. Test connection. If your Databricks storage is in virtual network, then you won't able to test the connection.

Screenshot that shows how to set up databricks UC connection.

Screenshot that shows how to configure databricks connection token.

Important

  • Data quality stewards need read only access to Azure Databricks Unity Catalog to set up a data quality connection.
  • If public access is disabled, you need to select the Allow trusted Microsoft services checkbox for Key Vault. This requirement applies only to Key Vault, not to your Azure Databricks workspace.
  • Virtual network support is generally available to all supported Azure regions. It's temporarily included in the Data Governance SKUs to maintain flexibility during this phase. Virtual network pricing isn't yet available to include in billing.

Profiling and data quality scanning for data in Azure Databricks Unity Catalog databases

After you successfully complete the connection setup, you can profile your data, create and apply rules, and run a data quality scan for your data in Azure Databricks Unity Catalog databases. Follow the step-by-step guidance in these resources:

Important

  • The fully qualified domain name (FQDN) for a data asset follows a pattern like databricks://(metastore-id)/catalogs/(catalog-name)/schemas/(schema-name)/tables/(table-name). You can find the FQDN details for your Azure Databricks data asset on the Data Map asset page.
  • If your connection parameters (in connection page) don't match with FQDN, your connection might still work but you see connection error on the data quality overview page for the selected databricks asset. Ensure that all corresponding fields are correctly filled in.
  • XS SQL Warehouse (WH) with one Node unit is a default SQL Warehouse on the ADB workspace and isn't a great compute for production grade usage specifically for medium or large datasets. Propose the review the reference document and adopt appropriate vertical scaling (selecting XS, S, M, L, XL) SQL WH and horizontal scaling by using 8, 16, 32, 64 nodes to scale and parallelize processing effectively. It's recommended to start with M (1-8) SQL WH and then proceed.

Resources