Share via


Set up data source connection for data quality in Unified Catalog

Data source connections set up the authentication needed to profile your data for statistical snapshot, or scan your data for data quality anomalies and scoring.

Setting up data source connections is the fourth step in the data quality life cycle for a data asset. Previous steps are:

  1. Assign users data quality steward permissions in Unified Catalog to use all data quality features.
  2. Register and scan a data source in your Microsoft Purview Data Map.
  3. Add data assets to a data product.

Prerequisites

  1. To create connections to data assets, your users must be in the data quality steward role.
  2. You need at least read access to the data source for which you're setting up the connection.

Supported multicloud data sources

Browse the supported data source document to view the list of supported data sources, including file formats for data profiling and data quality scanning, with and without virtual network support.

Currently, data quality scans can only run by using Managed Identity as an authentication option. Data quality services run on Apache Spark 3.4 and Delta Lake 2.4.

Important

To access these sources, either you need to set your Microsoft Azure Storage sources to have an open firewall, to Allow Trusted Azure Services, or to use private endpoints follow the guideline documented in the data quality managed virtual network configuration guide.

Set up data source connection

Follow these steps to create a new connection for the data products and data assets in a governance domain.

  1. In Unified Catalog, select Health management, then select Data quality.
  2. Select a governance domain from the list.
  3. From the Manage dropdown list, select Connections.
  4. On the Connections page, select New.
  5. On the Create connection flyout pane, enter a Display name and an optional Description.
  6. Select a Source type.
  7. Select one of the data sources: Azure subscription, Data Map, or enter a data source manually. Depending on which data source you choose, enter the required access details. The connection is then tested.
  8. If the test connection is successful, select Submit to complete the connection setup.

Tip

  • You can also create a connection to your resources by using private endpoints and a Microsoft Purview Data Quality managed virtual network. Learn more about setting up managed virtual networks for data quality.
  • Connection setup steps vary for native connectors. Check the connection setup steps from native connectors articles to setup connection for Azure Databricks, Snowflake, Google BigQuery, and Azure Synapse connectors.
  • To set up Azure Dedicated SQL Pool (formerly SQL DW) connection, users need to select source type as Azure SQL database and add sqldatawarehouse.database.windows.net as endpoint name.
  • The virtual network region is auto populated from the selected source region. Find details on managing virtual network provisioning.

Grant Microsoft Purview permissions on the source

After you create the connection, you need to grant Microsoft Purview managed identity permissions on your data sources to scan them:

Next steps

  1. Configure and run data profiling for an asset in your data source.
  2. Set up data quality rules based on the profiling results, and apply them to your data asset.
  3. Configure and run a data quality scan on a data product to assess the quality of all supported assets in the data product.
  4. Review your scan results to evaluate your data product's current data quality.