Edit

Share via


Use Azure Synapse Analytics to design an enterprise BI solution

Power BI
Azure Synapse Analytics
Azure Data Factory
Microsoft Entra ID
Azure Blob Storage

This article describes how to transfer data from an on-premises data warehouse to a cloud environment and then use a business intelligence (BI) model to serve the data. You can use this approach as an end goal or a first step toward full modernization with cloud-based components.

This guidance builds on the Azure Synapse Analytics end-to-end scenario. This process uses Azure Synapse Analytics pipelines to ingest data from a SQL database into SQL pools. Then it performs data transformation for analysis. This article focuses on Azure Synapse Analytics pipelines, but you can also use Azure Data Factory pipelines or Fabric Data Factory pipelines to perform these tasks.

When to use this architecture

You can use various methods to meet business requirements for enterprise BI. Various aspects define business requirements, such as current technology investments, human skills, the timeline for modernization, future goals, and whether you have a preference for platform as a service (PaaS) or software as a service (SaaS).

Consider the following design approaches:

The architecture in this article assumes that you use Azure Synapse Analytics data warehouse as the persistent layer of the enterprise semantic model and you use Power BI for business intelligence. This PaaS approach has the flexibility to accommodate various business requirements and preferences.

Architecture

Diagram that shows the enterprise BI architecture with Azure Synapse Analytics.

Download a Visio file of this architecture.

Workflow

Data source

Ingestion and data storage

Analysis and reporting

Components

This scenario uses the following components:

  • Azure SQL Database is an Azure-hosted PaaS SQL server. This architecture uses SQL Database to demonstrate the flow of data for the migration scenario.

  • Data Lake Storage provides flexible cloud storage for unstructured data that's used for persisting intermediate migration results.

  • Azure Synapse Analytics is an enterprise analytics service for data warehousing and big data systems. Azure Synapse Analytics serves as main compute and persistent storage in enterprise semantic modeling and servicing.

  • Power BI Premium is a BI tool that presents and visualizes data in this scenario.

  • Microsoft Entra ID is a multicloud identity and network solution suite that supports the authentication and authorization flow.

Simplified architecture

Diagram that shows the enterprise BI simplified architecture.

Scenario details

In this scenario, an organization has a SQL database that contains a large on-premises data warehouse. The organization wants to use Azure Synapse Analytics to perform analysis, then deliver these insights via Power BI to users and analytics.

Authentication

Microsoft Entra ID authenticates users who connect to Power BI dashboards and apps. Single sign-on connects users to the data source in an Azure Synapse Analytics provisioned pool. Authorization occurs on the source.

Incremental loading

When you run an automated extract, transform, load (ETL) or extract, load, transform (ELT) process, you should load only the data that changed since the previous run. This process is called an incremental load. Conversely, a full load loads all the data. To perform an incremental load, determine how to identify the changed data. You can use a high water mark value approach, which tracks the latest value of a date-time column or a unique integer column in the source table.

You can use temporal tables in SQL Server. Temporal tables are system-versioned tables that store data change history. The database engine automatically records the history of every change in a separate history table. To query the historical data, you can add a FOR SYSTEM_TIME clause to a query. Internally, the database engine queries the history table, but it's transparent to the application.

Temporal tables support dimension data, which can change over time. Fact tables usually represent an immutable transaction such as a sale, in which case keeping the system version history doesn't make sense. Instead, transactions usually have a column that represents the transaction date. The column can be used as the watermark value. For example, in the AdventureWorks data warehouse, the SalesLT.* tables have a LastModified field.

Here's the general flow for the ELT pipeline:

  1. For each table in the source database, track the cutoff time when the last ELT job ran. Store this information in the data warehouse. On initial setup, all times are set to 1-1-1900.

  2. During the data export step, the cutoff time is passed as a parameter to a set of stored procedures in the source database. These stored procedures query any records that are changed or created after the cutoff time. For all tables in the example, you can use the ModifiedDate column.

  3. When the data migration is complete, update the table that stores the cutoff times.

Data pipeline

This scenario uses the AdventureWorks sample database as a data source. The incremental data load pattern ensures that only data that's modified or added after the most recent pipeline run is loaded.

Metadata-driven copy tool

The built-in metadata-driven copy tool within Azure Synapse Analytics pipelines incrementally loads all tables that are contained in the relational database.

  1. Use a wizard interface to connect the Copy Data tool to the source database.

  2. After it connects, configure incremental loading or full loading for each table.

  3. The Copy Data tool creates the pipelines and SQL scripts needed to generate the control table. This table stores data, such as the high watermark value or column for each table, for the incremental loading process.

  4. After these scripts run, the pipeline loads all source data warehouse tables into the Azure Synapse Analytics dedicated pool.

Screenshot that shows the metadata-driven Copy Data tool in Azure Synapse Analytics.

Before the tool loads the data, it creates three pipelines to iterate over the tables in the database.

The pipelines do the following tasks:

  • Count the number of objects, such as tables, to be copied in the pipeline run.

  • Iterate over each object to be loaded or copied.

  • After a pipeline iterates over each object, it does the following tasks:

    • Checks whether a delta load is required. Otherwise, the pipeline completes a normal full load.

    • Retrieves the high watermark value from the control table.

    • Copies data from the source tables into the staging account in Data Lake Storage.

    • Loads data into the dedicated SQL pool via the selected copy method, such as the PolyBase or Copy command.

    • Updates the high watermark value in the control table.

Load data into an Azure Synapse Analytics SQL pool

The copy activity copies data from the SQL database into the Azure Synapse Analytics SQL pool. This example's SQL database is in Azure, so it uses the Azure integration runtime to read data from the SQL database and write the data into the specified staging environment.

The copy statement then loads data from the staging environment into the Azure Synapse Analytics dedicated pool.

Use Azure Synapse Analytics pipelines

Pipelines in Azure Synapse Analytics define an ordered set of activities to complete an incremental load pattern. Manual or automatic triggers start the pipeline.

Transform the data

The sample database in this reference architecture is small, so replicated tables that have no partitions are created. For production workloads, distributed tables can improve query performance. For more information, see Guidance for designing distributed tables in Azure Synapse Analytics. The example scripts run the queries via a static resource class.

In a production environment, consider creating staging tables that have round-robin distribution. Then transform and move the data into production tables that have clustered columnstore indexes, which offer the best overall query performance. Columnstore indexes are optimized for queries that scan many records.

Columnstore indexes don't perform optimally for singleton lookups, or looking up a single row. If you need to perform frequent singleton lookups, you can add a nonclustered index to a table, which increases speed. However, singleton lookups are typically less common in data warehouse scenarios than online transaction processing workloads. For more information, see Index tables in Azure Synapse Analytics.

Note

Clustered columnstore tables don't support varchar(max), nvarchar(max), or varbinary(max) data types. If you use those data types, consider a heap or clustered index. You might also consider putting these columns into a separate table.

Use Power BI Premium to access, model, and visualize data

Power BI Premium supports several options to connect to data sources on Azure. You can use Azure Synapse Analytics provisioned pools to do the following tasks:

  • Import: The data is imported into the Power BI model.
  • DirectQuery: Data is pulled directly from relational storage.
  • Composite model: Combine Import for some tables and DirectQuery for others.

This scenario uses the DirectQuery dashboard because it has a small amount of data and low model complexity. DirectQuery delegates the query to the underlying compute engine and uses security capabilities on the source. DirectQuery ensures that results are always consistent with the latest source data.

Import mode can provide the lowest query latency. Consider import mode if:

  • The model fits entirely within the memory of Power BI.
  • The data latency between refreshes is acceptable.
  • You require complex transformations between the source system and the final model.

In this case, the end users want full access to the most recent data with no delays in Power BI refreshing, and they want all historical data, which exceeds the Power BI dataset capacity. A Power BI dataset can handle 25-400 GB, depending on the capacity size. The data model in the dedicated SQL pool is already in a star schema and doesn't require transformation, so DirectQuery is an appropriate choice.

Screenshot that shows the dashboard in Power BI.

Use Power BI Premium to manage large models, paginated reports, and deployment pipelines. Take advantage of the built-in Azure Analysis Services endpoint. You can also have dedicated capacity with unique value proposition.

When the BI model grows or dashboard complexity increases, you can switch to composite models and import parts of lookup tables via hybrid tables, and import preaggregated data. You can enable query caching within Power BI for imported datasets and use dual tables for the storage mode property.

Within the composite model, datasets serve as a virtual pass-through layer. When users interact with visualizations, Power BI generates SQL queries to Azure Synapse Analytics SQL pools. Power BI determines whether to use in-memory or DirectQuery storage based on efficiency. The engine decides when to switch from in-memory to DirectQuery and pushes the logic to the Azure Synapse Analytics SQL pool. Depending on the context of the query tables, they can act as either cached (imported) or non-cached composite models. You can choose which table to cache into memory, combine data from one or more DirectQuery sources, or combine DirectQuery source data and imported data.

When you use DirectQuery with an Azure Synapse Analytics provisioned pool:

  • Use Azure Synapse Analytics result set caching to cache query results in the user database for repetitive use. This approach improves query performance to milliseconds and reduces compute resource usage. Queries that use cached results sets don't consume any concurrency slots in Azure Synapse Analytics, so they don't count against existing concurrency limits.

  • Use Azure Synapse Analytics materialized views to precompute, store, and maintain data like a table. Queries that use all data or a subset of the data in materialized views can achieve faster performance without needing to directly reference the defined materialized view to use it.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Well-Architected Framework.

Reliability

Reliability helps ensure that your application can meet the commitments that you make to your customers. For more information, see Design review checklist for Reliability.

Azure Synapse Analytics reliability

Do not disable geo-backup. By default, Azure Synapse Analytics takes a full backup of your data in Dedicated SQL Pool every 24 hours for disaster recovery. This gives you a base RPO of 24 hours. It is not recommended to turn this feature off. For more information, see Geo-backups.

Security

Security provides assurances against deliberate attacks and the misuse of your valuable data and systems. For more information, see Design review checklist for Security.

Cloud modernization introduces security concerns, such as data breaches, malware infections, and malicious code injection. You need a cloud provider or service solution that can address your concerns because inadequate security measures can create major problems.

This scenario addresses the most demanding security concerns by using a combination of layered security controls: network, identity, privacy, and authorization controls. An Azure Synapse Analytics provisioned pool stores most of the data. Power BI accesses the data via DirectQuery through single sign-on. You can use Microsoft Entra ID for authentication. There are also extensive security controls for data authorization within the provisioned pools.

Some common security questions include:

  • Define who can see what data.

    • Ensure that your data complies with federal, local, and company guidelines to mitigate data breach risks. Azure Synapse Analytics provides multiple data protection capabilities to achieve compliance.
  • Determine how to verify a user's identity.

  • Choose a network security technology to protect the integrity, confidentiality, and access of your networks and data.

  • Choose tools to detect and notify you of threats.

    • Use Azure Synapse Analytics threat detection capabilities, such as SQL auditing, SQL threat detection, and vulnerability assessment to audit, protect, and monitor databases.
  • Determine how to protect data in your storage account.

    • Use Azure Storage accounts for workloads that require fast and consistent response times or that have a high number of input/output operations (IOPs) per second. Storage accounts can store all your data objects and have several storage account security options.

Cost Optimization

Cost Optimization focuses on ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.

This section provides information about pricing for different services involved in this solution, and mentions decisions made for this scenario with a sample dataset. Use this starting configuration in the Azure pricing calculator, and adjust it to fit your scenario.

Azure Synapse Analytics cost optimization

Azure Synapse Analytics is a serverless architecture that you can use to scale your compute and storage levels independently. Compute resources incur costs based on usage. You can scale or pause these resources on demand. Storage resources incur costs per terabyte, so your costs increase as you ingest data.

Azure Synapse Analytics pipelines

Three main components influence the price of a pipeline:

  • Data pipeline activities and integration runtime hours
  • Data flows cluster size and implementation
  • Operation charges

For pricing details, see the Data Integration tab on Azure Synapse Analytics pricing.

The price varies depending on components or activities, frequency, and the number of integration runtime units.

For the sample dataset, which uses the standard Azure-hosted integration runtime, copy data activity serves as the core of the pipeline. It runs on a daily schedule for all the entities (tables) in the source database. The scenario doesn't contain data flows. And it doesn't incur operational costs because the pipelines run less than one million operations per month.

Azure Synapse Analytics dedicated pool and storage

For the sample dataset, you can provision 500 data warehouse units (DWUs) to provide a smooth experience for analytical loads. You can maintain compute during business hours for reporting purposes. If the solution moves to production, use reserved data warehouse capacity as a cost-efficient strategy. Use various techniques to maximize cost and performance metrics.

For pricing details for an Azure Synapse Analytics dedicated pool, see the Data Warehousing tab on Azure Synapse Analytics pricing. Under the dedicated consumption model, customers incur costs for each provisioned DWU, per hour of uptime. Also consider data storage costs, including the size of your data at rest, snapshots, and geo-redundancy.

Blob storage

Consider using the Azure Storage reserved capacity to reduce storage costs. With this model, you get a discount if you reserve fixed storage capacity for one or three years. For more information, see Optimize costs for blob storage with reserved capacity. This scenario doesn't use persistent storage.

Power BI Premium

This scenario uses Power BI Premium workspaces with built-in performance enhancements to accommodate demanding analytical needs.

For more information, see Power BI pricing.

Operational Excellence

Operational Excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Design review checklist for Operational Excellence.

  • Use an Azure DevOps release pipeline and GitHub Actions to automate the deployment of an Azure Synapse Analytics workspace across multiple environments. For more information, see Continuous integration and continuous delivery for an Azure Synapse Analytics workspace.

  • Put each workload in a separate deployment template, and store the resources in source control systems. You can deploy the templates together or individually as part of a continuous integration and continuous delivery (CI/CD) process. This approach simplifies the automation process. This architecture has four main workloads:

    • The data warehouse server and related resources
    • Azure Synapse Analytics pipelines
    • Power BI assets, including dashboards, apps, and datasets
    • An on-premises to cloud simulated scenario
  • Consider staging your workloads where practical. Deploy your workload to various stages. Run validation checks at each stage before you move to the next stage. This approach pushes updates to your production environments in a controlled way and minimizes unanticipated deployment problems. Use blue-green deployment and canary release strategies to update live production environments.

  • Use a rollback strategy to handle failed deployments. For example, you can automatically redeploy an earlier, successful deployment from your deployment history. Use the --rollback-on-error flag in the Azure CLI.

  • Use Azure Monitor to analyze the performance of your data warehouse and the entire Azure analytics platform for an integrated monitoring experience. Azure Synapse Analytics provides a monitoring experience within the Azure portal to show insights about your data warehouse workload. Use the Azure portal to monitor your data warehouse. It provides configurable retention periods, alerts, recommendations, and customizable charts and dashboards for metrics and logs.

For more information, see the following resources:

Performance Efficiency

Performance Efficiency refers to your workload's ability to scale to meet user demands efficiently. For more information, see Design review checklist for Performance Efficiency.

This section provides details about sizing decisions to accommodate this dataset.

Azure Synapse Analytics provisioned pool

You can use various data warehouse configurations.

DWUs Number of compute nodes Number of distributions per node
DW100c 1 60
-- TO --
DW30000c 60 1

To see the performance benefits of scaling out, especially for larger DWUs, use at least a 1-TB dataset. To find the best number of DWUs for your dedicated SQL pool, try scaling up and down. Run queries that have different numbers of DWUs after you load your data. Scaling is quick, so you can easily experiment with various performance levels.

Find the best number of DWUs

For a dedicated SQL pool in development, select a small number of DWUs as a starting point, such as DW400c or DW200c. Monitor your application performance for each number of DWUs. Assume a linear scale, and determine how much you need to increase or decrease the DWUs. Continue making adjustments until you reach an optimum performance level for your business requirements.

Scale an Azure Synapse Analytics SQL pool

For scalability and performance optimization features of pipelines in Azure Synapse Analytics and of the copy activity that you use, see Copy activity performance and scalability guide.

For more information, see the following resources:

Power BI Premium and Fabric

This article uses the Power BI Premium F64 capacity to demonstrate BI capabilities. Dedicated Power BI capacities in Fabric range from F64 (8 vCores) to F1024 (128 vCores).

To determine how much capacity you need:

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal authors:

Other contributors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps