Share via


Set up data quality for Microsoft Azure Synapse serverless and data warehouse

Microsoft Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. It brings together the best SQL technologies used in enterprise data warehousing, Apache Spark technologies for big data, and Azure Data Explorer for log and time series analytics. For more details, see the Azure Synapse Analytics documentation.

The following example shows a Synapse workspace with an instance of Dedicated Synapse Data Warehouse (DWH) Table EMPLOYEE and a Serverless Database (SQL_ON_DEMAND) with SynapseSalesDelta table.

Screenshot of synapse analytics workspace.

After you scan the assets, they're available in Microsoft Purview. The following example shows an Employee Table on Synapse Analytics Dedicated instance.

Azure Synapse Analytics Dedicated (Data Warehouse)

Set up Data Map scan

To scan Azure Synapse Analytics Dedicated (Data Warehouse) follow these instructions. To grant the necessary managed identity permissions on the Dedicated DWH instance, follow these steps.

Screenshot of data map scan configuration.

After you scan the assets, they're available in Microsoft Purview Unified Catalog. The following example shows an Employee Table on Synapse Analytics Dedicated instance:

Screenshot of data map scan result.

Set up connection to your Synapse dedicated data warehouse

At this point, you have the scanned asset ready for cataloging and governance. Associate the scanned asset to the data product in a governance domain. At the Data Quality Tab, add a new Azure SQL Database Connection: Get the Database Name entered manually.

  1. In Unified Catalog, go to Health management > Data quality and select a governance domain.

  2. On the governance domain's details page, select Manage, then select Connections.

  3. On the Connections page, select New and configure connection with the following information, as seen in the example below:

    • Add connection name and description.
    • Select source type Azure Synapse Analytics.
    • Select Azure subscription.
    • Select Workspace name.
    • Select Dedicated SQL endpoint.
    • Select serverless SQL endpoint.
    • Select Endpoint type.
    • Select Database.
    • Add MSI as Credential.

    Screenshot of how to set up data source connection.

  4. Test the connection. After configuring the data source connection and successfully testing it, you can proceed to configure and run data profiling and data quality scans.

  5. If your Synapse data source is located behind a private endpoint, you need to enable managed virtual networks. Follow the steps at Set up managed virtual networks.

Important

Data Quality stewards need read only access to synapse dedicated data warehouse to setup data quality connection. For managed virtual network setup, you can't test the connection.

Profiling and data quality scanning for data in Synapse dedicated data warehouse

After you set up the connection, you can profile your data, create and apply rules, and run a data quality scan for your data in Synapse warehouse. Follow the step-by-step guidelines described in the following articles:

Important

  • The performance of the queries and even their successful runs depend on the DW configuration you have for your dedicated database instances.
  • Respective data quality assessment jobs or any other data quality job induce a connection on the Dedicated DW and might fail if the instance is under provisioned or fails on concurrency limits. You need to be aware of the DW configuration. Its concurrency has hard limits for any instance in time.
  • Concurrency limits might lead to job termination. DW limits (such 1000 DW) provide the power to run the queries.

Azure Synapse Analytics Serverless

Set up Data Map scan

To scan Azure Synapse Analytics Serverless, follow these instructions. To grant the necessary managed identity permissions on the Dedicated DWH instance, follow these steps. Once scanned, the serverless assets are available in Unified Catalog.

Screenshot of data map scan configuration for serverless.

Set up connection to your synapse Serverless

At this point, you have the scanned asset ready for cataloging and governance. Associate the scanned asset to the data product in a governance domain. In Data Quality, add a new SQL Database Connection: Get the Database Name entered manually.

  1. In Unified Catalog, go to Health management > Data quality and select a governance domain.

  2. On the governance domain's details page, select Manage, then select Connections.

  3. On the Connections page, select New and configure connection with the following information, as seen in the example below

    • Add connection name and description.
    • Select source type Azure Synapse Analytics.
    • Select Azure subscription.
    • Select Workspace name.
    • Select Dedicated SQL endpoint.
    • Select serverless SQL endpoint.
    • Select Endpoint type.
    • Select Database.
    • Add MSI as Credential.

    Screenshot of how to set up data source connection for synapse serverless.

  4. Test the connection. After configuring the data source connection and successfully testing it, you can proceed to configure and run data profiling and data quality scans.

  5. If your Synapse data source is located behind a private endpoint, you need to enable managed virtual networks. Follow the steps at Set up managed virtual networks.

Important

  • Data Quality stewards need read only access to synapse dedicated data warehouse to set up data quality connection.
  • In Synapse serverless setup, the external table points to Delta formatted data stored in ADLS Gen2.
  • Synapse Connector only detects and supports sql.azuresynapse.net. If Fully Qualified Name (FQN) generated by your Data Map scan contains database.windows.net, then your Synapse connection for data quality scan will fail.

Profiling and data quality scanning for data in Synapse serverless

After you set up the connection, you can profile your data, create and apply rules, and run a data quality scan for your data in Synapse warehouse. Follow the step-by-step guidelines described in these articles:

Important

  • The data quality assessments and profiling run on Spark in the background. You have multiple connections where each Spark node has a connection SPID. Hence, Data Warehouse might run into current query limits if you use or schedule beyond Data Warehouse limits, resulting in failures. But for Azure Synapse Serverless SQL Table, no such concurrency limits apply. It depends on the Serverless Delta Parquet optimizations you have on your ADLS Gen2 instance. The engine closely resonates with Databricks Serverless Data Warehouse. Both operate on external Lakehouse sources such as DELTA format tables.

Resources