Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to: Azure Synapse Analytics
This article helps you identify the reasons and apply mitigations for common performance issues with queries on an Azure Synapse Analytics dedicated SQL pool.
Follow the steps to troubleshoot the issue or execute the steps in the notebook via Azure Data Studio. The first three steps walk you through collecting telemetry, which describes the lifecycle of a query. The references at the end of the article help you analyze potential opportunities found in the data collected.
Note
Before attempting to open this notebook, make sure that Azure Data Studio is installed on your local machine. To install it, go to Learn how to install Azure Data Studio.
Important
Most of the reported performance issues are caused by:
- Outdated statistics
- Unhealthy clustered columnstore indexes (CCIs)
To save troubleshooting time, make sure that the statistics are created and up-to-date, and rebuild clustered columnstore indexes in the dedicated SQL pool.
Step 1: Identify the request_id (also known as QID)
The request_id of the slow query is required to research potential reasons for a slow query. Use the following script as a starting point for identifying the query you want to troubleshoot. Once the slow query is identified, note down the request_id value.
First, monitor the active queries. This query is ordered by the newest rows first.
-- Monitor active queries
SELECT *
FROM sys.dm_pdw_exec_requests
WHERE [status] NOT IN ('Completed', 'Failed', 'Cancelled')
      AND session_id <> session_id()
-- AND [label] = '<YourLabel>'
-- AND resource_allocation_percentage is not NULL
ORDER BY submit_time DESC;
Then, find the active queries that have the longest run time, starting with the longest-running queries.
-- Find top 10 longest running queries
SELECT TOP 10 *
FROM sys.dm_pdw_exec_requests
ORDER BY total_elapsed_time DESC;
To better target the slow queries, use the following tips when you run the script:
- Sort by either submit_time DESCortotal_elapsed_time DESCto have the longest-running queries present at the top of the result set.
- Use OPTION(LABEL='<YourLabel>')in your queries and then filter thelabelcolumn to identify them.
- Consider filtering out any QIDs that don't have a value for resource_allocation_percentagewhen you know the target statement is contained in a batch. Be cautious with this filter as it might also filter out some queries that are being blocked by other sessions.
Step 2: Determine where the query is taking time
Run the following script to find the step that might cause the performance issue of the query. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly value to 0 to get the full picture of the distributed plan. Take note of the StepIndex, Phase, and Description values of the slow step identified from the result set.
| Parameter | Description | 
|---|---|
| @QID | The request_idvalue is obtained in Step 1. | 
| @ShowActiveOnly | Setting the value to 0shows all steps for the query.Setting the value to 1shows only the currently active step. | 
DECLARE @QID AS VARCHAR (16) = '<request_id>', @ShowActiveOnly AS BIT = 1;
-- Retrieve session_id of QID
DECLARE @session_id AS VARCHAR (16) = (SELECT session_id
                                       FROM sys.dm_pdw_exec_requests
                                       WHERE request_id = @QID);
-- Blocked by Compilation or Resource Allocation (Concurrency)
SELECT @session_id AS session_id, @QID AS request_id, -1 AS [StepIndex], 'Compilation' AS [Phase],
   'Blocked waiting on '
       + MAX(CASE WHEN waiting.type = 'CompilationConcurrencyResourceType' THEN 'Compilation Concurrency'
                  WHEN waiting.type LIKE 'Shared-%' THEN ''
                  ELSE 'Resource Allocation (Concurrency)' END)
       + MAX(CASE WHEN waiting.type LIKE 'Shared-%' THEN ' for ' + REPLACE(waiting.type, 'Shared-', '')
             ELSE '' END) AS [Description],
   MAX(waiting.request_time) AS [StartTime], GETDATE() AS [EndTime],
   DATEDIFF(ms, MAX(waiting.request_time), GETDATE())/1000.0 AS [Duration],
   NULL AS [Status], NULL AS [EstimatedRowCount], NULL AS [ActualRowCount], NULL AS [TSQL]
FROM sys.dm_pdw_waits waiting
WHERE waiting.session_id = @session_id
      AND ([type] LIKE 'Shared-%'
           OR [type] IN ('ConcurrencyResourceType', 'UserConcurrencyResourceType', 'CompilationConcurrencyResourceType'))
      AND [state] = 'Queued'
GROUP BY session_id
-- Blocked by another query
UNION ALL
SELECT @session_id AS session_id,
       @QID AS request_id,
       -1 AS [StepIndex],
       'Compilation' AS [Phase],
       'Blocked by ' + blocking.session_id + ':' + blocking.request_id + ' when requesting ' + waiting.type + ' on ' + QUOTENAME(waiting.object_type) + waiting.object_name AS [Description],
       waiting.request_time AS [StartTime],
       GETDATE() AS [EndTime],
       DATEDIFF(ms, waiting.request_time, GETDATE()) / 1000.0 AS [Duration],
       NULL AS [Status],
       NULL AS [EstimatedRowCount],
       NULL AS [ActualRowCount],
       COALESCE (blocking_exec_request.command, blocking_exec_request.command2) AS [TSQL]
FROM sys.dm_pdw_waits AS waiting
     INNER JOIN
     sys.dm_pdw_waits AS blocking
     ON waiting.object_type = blocking.object_type
        AND waiting.object_name = blocking.object_name
     INNER JOIN
     sys.dm_pdw_exec_requests AS blocking_exec_request
     ON blocking.request_id = blocking_exec_request.request_id
WHERE waiting.session_id = @session_id
      AND waiting.state = 'Queued'
      AND blocking.state = 'Granted'
      AND waiting.type != 'Shared'
-- Request Steps
UNION ALL
SELECT @session_id AS session_id,
       @QID AS request_id,
       step_index AS [StepIndex],
       'Execution' AS [Phase],
       operation_type + ' (' + location_type + ')' AS [Description],
       start_time AS [StartTime],
       end_time AS [EndTime],
       total_elapsed_time / 1000.0 AS [Duration],
       [status] AS [Status],
       CASE WHEN estimated_rows > -1 THEN estimated_rows END AS [EstimatedRowCount],
       CASE WHEN row_count > -1 THEN row_count END AS [ActualRowCount],
       command AS [TSQL]
FROM sys.dm_pdw_request_steps
WHERE request_id = @QID
      AND [status] = CASE @ShowActiveOnly WHEN 1 THEN 'Running' ELSE [status] END
ORDER BY StepIndex;
Step 3: Review step details
Run the following script to review the details of the step identified in the previous step. Update the variables in the script with the values described in the following table. Change the @ShowActiveOnly value to 0 to compare all distribution timings. Take note of the wait_type value for the distribution that might cause the performance issue.
| Parameter | Description | 
|---|---|
| @QID | The request_idvalue is obtained in Step 1. | 
| @StepIndex | The StepIndexvalue is identified in Step 2. | 
| @ShowActiveOnly | Setting the value to 0shows all distributions for the givenStepIndexvalue.Setting the value to 1shows only the currently active distributions for the givenStepIndexvalue. | 
DECLARE @QID VARCHAR(16) = '<request_id>', @StepIndex INT = <StepIndex>, @ShowActiveOnly BIT = 1;
WITH dists
AS (SELECT request_id, step_index, 'sys.dm_pdw_sql_requests' AS source_dmv,
       distribution_id, pdw_node_id, spid, 'NativeSQL' AS [type], [status],
       start_time, end_time, total_elapsed_time, row_count
    FROM sys.dm_pdw_sql_requests
    WHERE request_id = @QID AND step_index = @StepIndex
    UNION ALL
    SELECT request_id, step_index, 'sys.dm_pdw_dms_workers' AS source_dmv,
       distribution_id, pdw_node_id, sql_spid AS spid, [type],
       [status], start_time, end_time, total_elapsed_time, rows_processed as row_count
    FROM sys.dm_pdw_dms_workers
    WHERE request_id = @QID AND step_index = @StepIndex
   )
SELECT sr.step_index, sr.distribution_id, sr.pdw_node_id, sr.spid,
       sr.type, sr.status, sr.start_time, sr.end_time,
       sr.total_elapsed_time, sr.row_count, owt.wait_type, owt.wait_time
FROM dists sr
   LEFT JOIN sys.dm_pdw_nodes_exec_requests owt
      ON sr.pdw_node_id = owt.pdw_node_id
         AND sr.spid = owt.session_id
         AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests'
                 AND sr.status = 'Running') -- sys.dm_pdw_sql_requests status
              OR (sr.source_dmv = 'sys.dm_pdw_dms_requests'
                     AND sr.status not LIKE 'Step[CE]%')) -- sys.dm_pdw_dms_workers final statuses
WHERE sr.request_id = @QID
      AND ((sr.source_dmv = 'sys.dm_pdw_sql_requests' AND sr.status =
               CASE WHEN @ShowActiveOnly = 1 THEN 'Running' ELSE sr.status END)
           OR (sr.source_dmv = 'sys.dm_pdw_dms_workers' AND sr.status NOT LIKE
                  CASE WHEN @ShowActiveOnly = 1 THEN 'Step[CE]%' ELSE '' END))
      AND sr.step_index = @StepIndex
ORDER BY distribution_id
Step 4: Diagnose and mitigate
Compilation phase issues
- According to the - Descriptionvalues obtained in Step 2, check the relevant section for more information from the following table.- Description - Common Cause - Compilation Concurrency- Blocked: Compilation Concurrency - Resource Allocation (Concurrency)- Blocked: resource allocation 
- If the query is in "Running" status identified in Step 1, but there's no step information in Step 2, check the cause that best fits your scenario to get more information from the following table. - Scenario - Common Cause - Statement contains complex join-filter logic or performs joins in - WHEREclause- Complex query or older JOIN syntax - Statement is a long-running - DROP TABLEor- TRUNCATE TABLEstatement- Long-running DROP TABLE or TRUNCATE TABLE - CCIs have high percentage of deleted or open rows (see Optimizing clustered columnstore indexes) - Unhealthy CCIs (generally) 
- Analyze the result set in Step 1 for one or more - CREATE STATISTICSstatements executed immediately after the slow query submission. Check the cause that best fits your scenario from the following table.- Scenario - Common Cause - Statistics created unexpectedly - Delay from auto-create statistics - Statistics creation failed after 5 minutes - Auto-create statistics timeouts 
Blocked: Compilation Concurrency
Concurrency Compilation blocks rarely occur. However, if you encounter this type of block, it signifies that a large volume of queries were submitted in a short time and have been queued to begin compilation.
Mitigations
Reduce the number of queries submitted concurrently.
Blocked: resource allocation
Being blocked for resource allocation means that your query is waiting to execute based on:
- The amount of memory granted based on the resource class or workload group assignment associated with the user.
- The amount of available memory on the system or workload group.
- (Optional) The workload group/classifier importance.
Mitigations
- Wait for the blocking session to complete.
- Evaluate the resource class choice. For more information, see concurrency limits.
- Evaluate if it's preferable to Kill the blocking session.
Complex query or older JOIN syntax
You might encounter a situation where the default query optimizer methods are proven ineffective as the compilation phase takes a long time. It might occur if the query:
- Involves a high number of joins and/or subqueries (complex query).
- Utilizes joiners in the FROMclause (not ANSI-92 style joins).
Though these scenarios are atypical, you have options to attempt to override the default behavior to reduce the time it takes for the query optimizer to choose a plan.
Mitigations
- Use ANSI-92 style joins.
- Add query hints: OPTION(FORCE ORDER, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION')). For more information, see FORCE ORDER and Cardinality Estimation (SQL Server).
- Break the query into multiple, less complex steps.
Long-running DROP TABLE or TRUNCATE TABLE
For execution time efficiencies, the DROP TABLE and TRUNCATE TABLE statements defer storage cleanup to a background process. However, if your workload performs a high number of DROP/TRUNCATE TABLE statements in a short time frame, it's possible that metadata becomes crowded and causes subsequent DROP/TRUNCATE TABLE statements to execute slowly.
Mitigations
Identify a maintenance window, stop all workloads, and run DBCC SHRINKDATABASE to force an immediate cleanup of previously dropped or truncated tables.
Unhealthy CCIs (generally)
Poor clustered columnstore index (CCI) health requires extra metadata, which can cause the query optimizer to take more time to determine an optimal plan. To avoid this situation, ensure that all of your CCIs are in good health.
Mitigations
Assess and correct clustered columnstore index health in a dedicated SQL pool.
Delay from auto-create statistics
The automatic create statistics option, AUTO_CREATE_STATISTICS is ON by default to help ensure the query optimizer can make good distributed plan decisions. However, the auto-creation process itself can make an initial query take longer than subsequent executions of the same.
Mitigations
If the first execution of query consistently requires statistics to be created, you'll need to manually create statistics prior to the execution of the query.
Auto-create statistics timeouts
The automatic create statistics option, AUTO_CREATE_STATISTICS is ON by default to help ensure the query optimizer can make good distributed plan decisions. The auto-creation of statistics occurs in response to a SELECT statement and has a 5-minute threshold to complete. If the size of data and/or the number of statistics to be created require longer than the 5-minute threshold, the auto-creation of statistics will be abandoned so that the query can continue execution. The failure to create the statistics can negatively affect the query optimizer's ability to generate an efficient distributed execution plan, resulting in poor query performance.
Mitigations
Manually create the statistics instead of relying on the auto-create feature for the identified tables/columns.
Execution phase issues
- Use the following table to analyze the result set in Step 2. Determine your scenario and check the common cause for detailed information and the possible mitigation steps. - Scenario - Common Cause - EstimatedRowCount/- ActualRowCount< 25%- Inaccurate estimates - The - Descriptionvalue indicates- BroadcastMoveOperationand the query references a replicated table.- Uncached replicated tables - 1. - @ShowActiveOnly= 0
 2. High or unexpected number of steps (- step_index) is observed.
 3. Data types of joiner columns aren't identical between tables.- Mismatched data type/size - 1. The - Descriptionvalue indicates- HadoopBroadcastOperation,- HadoopRoundRobinOperationor- HadoopShuffleOperation.
 2. The- total_elapsed_timevalue of a given- step_indexis inconsistent between executions.- Ad hoc external table queries 
- Check the - total_elapsed_timevalue obtained in Step 3. If it's significantly higher in a few distributions in a given step, follow the these steps:- Check the data distribution for every table referenced in the - TSQLfield for associated- step_idby running the following command against each:- DBCC PDW_SHOWSPACEUSED(<table>);
- If <minimum rows value>/<maximum rows value> > 0.1, go to Data skew (stored). 
- Otherwise, go to In-flight data skew. 
 
Inaccurate estimates
Have your statistics up-to-date to ensure that the query optimizer generates an optimal plan. When the estimated row count is significantly less than the actual counts, the statistics need to be maintained.
Mitigations
First, check the accuracy of the statistics on a dedicated SQL pool. If necessary, create or update the statistics.
Uncached replicated tables
If you have created replicated tables, and you fail to warm the replicated table cache properly, unexpected poor performance will result due to extra data movements or the creation of a suboptimal distributed plan.
Mitigations
- Warm the replicated cache after DML operations.
- If there are frequent DML operations, change the distribution of the table to ROUND_ROBIN.
Mismatched data type/size
When joining tables, make sure that the data type and size of the joining columns match. Otherwise, it will result in unnecessary data movements that will decrease the availability of CPU, IO, and network traffic to the remainder of the workload.
Mitigations
Rebuild the tables to correct the related table columns that don't have identical data type and size.
Ad hoc external table queries
Queries against external tables are designed with the intention of bulk loading data into the dedicated SQL pool. Ad hoc queries against external tables might suffer variable durations due to external factors, such as concurrent storage container activities.
Mitigations
Load data into the dedicated SQL pool first and then query the loaded data.
Data skew (stored)
Data skew means the data isn't distributed evenly across the distributions. Each step of the distributed plan requires all distributions to complete before moving to the next step. When your data is skewed, the full potential of the processing resources, such as CPU and IO, can't be achieved, resulting in slower execution times.
Mitigations
Review our guidance for distributed tables to assist your choice of a more appropriate distribution column.
In-flight data skew
In-flight data skew is a variant of the data skew (stored) issue. But, it's not the distribution of data on disk that is skewed. The nature of the distributed plan for particular filters or grouped data causes a ShuffleMoveOperation type operation. This operation produces a skewed output to be consumed downstream.
Mitigations
- Ensure that statistics are created and up-to-date. You can verify their accuracy by following the steps outlined in Check statistics accuracy on a dedicated SQL pool.
- Change the order of your GROUP BYcolumns to lead with a higher-cardinality column.
- Create multi-column statistics if joins cover multiple columns.
- Add query hint OPTION(FORCE_ORDER)to your query.
- Refactor the query.
Wait type issues
If none of the above common issues apply to your query, the Step 3 data affords the opportunity to determine which wait types (in wait_type and wait_time) are interfering with query processing for the longest-running step. There are a large number of wait types, and they're grouped into related categories due to similar mitigations. Follow these steps to locate the wait category of your query step:
- Identify the wait_typein Step 3 that is taking the most time.
- Locate the wait type in wait categories mapping table and identify the wait category it included in.
- Expand the section related to the wait category from the following list for recommended mitigations.
Compilation
Follow these steps to mitigate wait type issues of the Compilation category:
- Rebuild indexes for all objects involved in the problematic query.
- Update statistics on all objects involved in the problematic query.
- Test the problematic query again to validate whether the issue persists.
If the issue persists, then:
- Create a .sql file with: - SET QUERY_DIAGNOSTICS ON; <Your_SQL>; SET QUERY_DIAGNOSTICS OFF;
- Open a Command Prompt window and run the following command: - sqlcmd −S <servername>.database.windows.net −d <databasename> −U <username> −G −I −i .\<sql_file_name>.sql −y0 −o .\<output_file_name>.txt
- Open <output_file_name>.txt in a text editor. Locate and copy paste the distribution-level execution plans (lines that begin with - <ShowPlanXML>) from the longest-running step identified in Step 2 into separate text files with a .sqlplan extension.- Note: Each step of the distributed plan will typically have recorded 60 distribution-level execution plans. Make sure that you're preparing and comparing execution plans from the same distributed plan step. 
- The Step 3 query frequently reveals a few distributions that take much longer than others. In SQL Server Management Studio, compare the distribution-level execution plans (from the .sqlplan files created) of a long-running distribution to a fast-running distribution to analyze potential causes for differences. 
Lock, Worker Thread
- Consider changing tables that undergo frequent, small changes to utilize a row store index instead of CCI.
- Batch up your changes and update the target with more rows on a less frequent basis.
Buffer IO, Other Disk IO, Tran Log IO
Unhealthy CCIs
Unhealthy CCIs contribute to increased IO, CPU, and memory allocation, which, in turn, negatively impacts the query performance. To mitigate this issue, try one of the following methods:
- Assess and correct clustered columnstore index health in a dedicated SQL pool.
- Run and review the output of the query listed at Optimizing clustered columnstore indexes to get a baseline.
- Follow the steps to rebuild indexes to improve segment quality, targeting the tables involved in the example problem query.
Outdated statistics
Outdated statistics can cause the generation of an unoptimized distributed plan, which involves more data movement than necessary. Unnecessary data movement increases the workload not only on your data at rest but also on the tempdb. Because IO is a shared resource across all queries, performance impacts can be felt by the entire workload.
The optimizer relies on statistics to estimate the number of rows that will be returned by a query. Statistics allow the query optimizer to choose the most efficient plan or perform the best move operation (for example, a Shuffle Move Operation or Broad Cast Move Operation) to align the data during the join condition. The best join condition depends on the table distribution type.
For example, if the actual number of rows for a given table is 60 million and the estimated number of rows is 1,000 (at the control node level), the optimizer might choose a Broadcast move operation. This behavior is because the cost is perceived to be lower compared to a Shuffle Move, given the optimizer's assumption that the table contains only 1,000 rows. However, once the actual execution begins, the engine will move 60 million rows as part of the execution using a Broadcast move, which can be an expensive operation considering both the data size and row count. Consequently, if the data size is substantial, it might lead to performance issues for the query itself and other queries, resulting in high CPU usage.
To remedy this situation, ensure all statistics are up-to-date and a maintenance plan is in place to keep them updated for user workloads. You can verify the accuracy of the statistics by following the steps outlined in Check statistics accuracy on a dedicated SQL pool.
Heavy IO workloads
Your overall workload might be reading large amounts of data. Synapse dedicated SQL pools scale resources in accordance with the DWU. In order to achieve better performance, consider either or both:
- Utilizing a larger resource class for your queries.
- Increase compute resources.
CPU, Parallelism
| Scenario | Mitigation | 
|---|---|
| Poor CCI Health | Assess and correct clustered columnstore index health in a dedicated SQL pool | 
| User queries contain transformations | Move all formatting and other transformation logic into ETL processes so the formatted versions are stored | 
| Workload improperly prioritized | Implement workload isolation | 
| Insufficient DWU for workload | Consider increasing compute resources | 
Network IO
If the issue occurs during a RETURN operation in Step 2,
- Reduce the number of concurrent parallel processes.
- Scale out the most affected process to another client.
For all other data movement operations, it's probable that the network issues appear to be internal to the dedicated SQL pool. To attempt to quickly mitigate this issue, follow these steps:
- Scale your dedicated SQL pool to DW100c
- Scale back to your desired DWU level
SQL CLR
Avoid frequent use of the FORMAT() function by implementing an alternate way of transforming the data (for example, CONVERT() with style).