Edit

Share via


Choose an analytical data store in Microsoft Fabric

Analytical data stores are essential for storing, processing, and serving data to support various analytical workloads. Microsoft Fabric is a unified data platform that provides several analytical data stores as software as a service (SaaS). Each data store provides distinct capabilities to address different analytical requirements. Selecting the right analytical data store depends on factors such as data volume, data type, compute engine, ingestion and transformation patterns, query needs, access controls, and integration with OneLake and other Microsoft Fabric components. This article compares the analytical data stores and provides a decision guide to help you choose the best option for common workloads.

Overview of primary analytical data stores in Microsoft Fabric

This article covers SQL databases, data warehouses, lakehouses, and eventhouses as the primary analytical data stores in Microsoft Fabric. Microsoft Fabric also has other items that can store data but aren't treated as primary analytical data stores. For example, Power BI semantic models can store data, but they're typically used as a semantic layer. Other Power BI items, like Power BI Dataflows Gen 1, store data for Power BI solutions only. Similarly, Fabric Cosmos DB stores data physically but is typically optimized for operational workloads rather than analytical workloads.

SQL databases

SQL databases in Microsoft Fabric support structured data and accommodate both transactional and analytical workloads. They're ideal for moderate data volumes, typically between several gigabytes (GB) and a few terabytes (TB). SQL databases handle a broad range of data types, from integers, strings, and dates to geometry, geography, JSON, and XML.

A T-SQL-based relational engine underpins SQL databases. It handles high-frequency updates and operations that require transactional consistency and referential integrity. SQL databases support batch and transactional data ingestion. They also perform data transformation through stored procedures, views, user-defined functions, and SQL queries.

SQL databases provide low-latency queries, highly selective lookups, and concurrent data retrieval. They enforce granular access controls at the object, column, and row levels. Automatic OneLake mirroring ensures that SQL databases integrate with the broader Fabric ecosystem. You can process data by using any Fabric compute engine, run cross-warehouse queries, and connect to Direct Lake mode semantic models in Power BI.

Data warehouses

Data warehouses in Microsoft Fabric support large-scale analytical workloads and handle data volumes that range from GB to petabytes (PB). They use a relational engine that delivers high-throughput batch data ingestion and flexible transformations through stored procedures, views, and other T-SQL queries. Data warehouses excel at diverse query patterns and complex analytics over vast datasets. Efficient workload management and burstable capacity ensure high concurrency and extensive access controls at the object, column, and row levels. Fabric data warehouses store data in OneLake and expose it in Delta format to any Fabric compute engine.

Lakehouses

Lakehouses combine features of data lakes and data warehouses and provide a unified platform for structured and unstructured data. They can manage PBs of data and support structured, semi-structured, and unstructured types.

Lakehouses use a flexible, scalable Spark compute engine that supports PySpark, Spark SQL, Scala, and R for complex data engineering and data science scenarios. They support both batch and real-time ingestion to meet diverse analytical workloads.

Lakehouses sit on OneLake and store data in Delta format to promote sharing and interoperability across the enterprise. Lakehouses excel at analytical data retrieval and can query massive data volumes. An integrated SQL analytics endpoint lets you query OneLake data by using T-SQL as if it's a relational data warehouse while enforcing granular access controls at the object, column, and row levels. Similarly, an integrated eventhouse endpoint unlocks the performance and rich capabilities of the KQL language.

Eventhouses

Eventhouses in Microsoft Fabric handle real-time event processing and analytics at millions of events per second. They ingest structured, semi-structured, and unstructured data for streaming events and time-series analysis. Kusto Query Language (KQL) and a subset of T-SQL power real-time analytics and event processing in eventhouses. Real-time ingestion patterns are optimized for high-velocity streams, and batch ingestion is also supported. KQL update policies transform data and enable real-time analytics.

Eventhouses scale efficiently to support concurrent query patterns and enforce granular access controls at the object, column, and row levels. KQL databases in eventhouses support granular access controls to ensure that only authorized users can access data. You can configure eventhouses to publish data automatically to OneLake for consumption by other Fabric experiences. This configuration adds latency but enables broader integration across the Fabric ecosystem.

Comparison of analytical data stores

The following table provides a comparison of key features of analytical data stores in Microsoft Fabric. Throughout this table, the ✅ symbol indicates that the capability is readily supported and recommended. The ⚠️ symbol indicates that the capability is supported with some considerations or limitations. The ❌ symbol indicates that the capability is typically not supported or recommended.

There's no commonly accepted definition for the terms small data and big data, and definitions of these terms continue to change over time along with evolving capabilities of data platforms. In this decision guide, small data refers to the total data volumes that range from megabytes (MB) to hundreds of GB, with individual tables up to a hundred GB in size and up to tens of GB of data ingested per day. The term big data refers to total data volumes measured in tens of TB to PB, individual tables that are multiple TB in size, and data ingestion rates that exceed hundreds of GB per day. Data volumes that fall between the small data and big data thresholds can be described as moderate or medium data.

Capability Capability details SQL database Data warehouse Lakehouse Eventhouse
Data volumes
Small 1 1 1
Moderate
Big
Supported types of data
Structured
Semi-structured ⚠️ ⚠️ 
Unstructured ❌ 
Primary compute engine
Write operations T-SQL T-SQL Spark (PySpark, Spark SQL, Scala, R), Python KQL
Read operations T-SQL T-SQL T-SQL2, Spark (PySpark, Spark SQL, Scala, R), Python, KQL3 KQL, T-SQL2
Data ingestion patterns 
Typical ingestion frequency Moderate-high Moderate Moderate-high High
Recommended batch size Small-medium Medium-large Small-large Small-large
Efficiency of appends High High High High
Efficiency of updates and deletes High Moderate Moderate Low
Data ingestion tools in Microsoft Fabric 
Pipelines
Dataflows Gen 2
Shortcuts ⚠️
Eventstreams ❌  ❌ 
Spark connectors ⚠️ ⚠️ ⚠️
T-SQL commands
KQL commands
Data transformation capabilities 
Various types of supported structured data High Moderate Moderate Moderate
Parsing of semi-structured data ⚠️ ⚠️ 
Parsing of unstructured data ❌  ⚠️
SQL support (any dialect) ⚠️
SQL surface area (any dialect) Broad Moderate Broad Limited2
T-SQL surface area Broad Moderate Limited2 Limited2
Python support ⚠️
Spark support (PySpark, Spark SQL, Scala, R)
KQL support ⚠️3 
Transformation extensibility4 Moderate Moderate Very high High
Single-table transaction support
Multi-table transaction support ⚠️
Data retrieval patterns
Optimized for selective lookups ❌ 
Optimized for large scans and aggregations ⚠️
Ideal query runtime5 Milliseconds+ Tens of milliseconds+ Tens of milliseconds+ Milliseconds+
Realistic query runtime6 Subsecond+ Seconds+ Seconds+ Subsecond+
Peak query concurrency7 High High High High
Peak query throughput8 Very high High High Very high
Granular access controls 
Object-level security Yes Yes Yes Yes9
Column-level security Yes Yes Yes10 No
Row-level security Yes Yes Yes10 Yes
OneLake integration
Data available in OneLake Yes11 Yes Yes Yes12
Data stored in open format (Delta) Yes11 Yes Yes Yes12
Can be a source of shortcuts Yes11 Yes Yes Yes12
Access data via shortcuts No Yes13 Yes Yes
Cross-warehouse and lakehouse queries Yes14 Yes Yes Yes12
Compute management
Ability to customize size and configuration of compute resources Low Low High Low
Administrative skillset needed to manage or tune compute resources Low Low Moderate-high Low

Notes:

1 Data warehouses, lakehouses, and eventhouses don't have minimum data volume requirements and provide equivalent functionality across all data volumes. However, some benefits provided by these highly scalable systems might not be fully realized with small data volumes.

2 Lakehouses and eventhouses support a subset of T-SQL surface area and are limited to read-only operations.

3 Lakehouses expose an eventhouse endpoint, which supports read-only KQL operations.

4 Refers to the ability to extend data transformations by using user-defined functions, methods, referencing external modules or libraries, and other approaches.

5 Represents lower bounds of runtimes for light queries that use small volumes of data from warm cache, excluding network latency or the time needed to render results in a client application. Numerous factors influence query runtimes. Results might vary based on your specific workload.

6 Represents lower bounds of response times to mixed queries that use moderate volumes of data, excluding network latency or the time needed to render results in a client application. Numerous factors influence query runtimes. Results might vary based on your specific workload.

7 Peak number of queries that can run simultaneously, relative to other analytical data stores.

8 Peak number of queries that can be completed over a given period of time, relative to other analytical data stores. Concurrency, query duration, and other factors affect the number of queries.

9 Partial object-level security is implemented by using restricted view access policies.

10 Granular access controls are available for the SQL analytics endpoint.

11 OneLake integration is implemented via automatic database mirroring.

12 Via automatic sync from KQL Database to OneLake.

13 Indirectly, via cross-database queries to lakehouses.

14 Available for mirrored data accessed via the SQL analytics endpoint.

Decision tree for analytical store selection in Microsoft Fabric

The following decision guide helps you select a suitable data store for each use case or data product. You might need more than one analytical data store to support different workloads in your data estate.

Diagram that shows a decision tree that describes how to select an appropriate data store in Microsoft Fabric for different scenarios.

Conclusion

SQL databases, data warehouses, lakehouses, and eventhouses enable Microsoft Fabric to handle diverse analytical workloads. Each of these analytical data stores provides a unique blend of capabilities and limitations that must be matched to the workload to achieve optimal results. Some use cases can be addressed by using a single analytical data store. However, specific complex use cases that involve mixed workloads are best served by using multiple complementary analytical data stores, which are readily available in Microsoft Fabric as the unified data platform.

Contributors

Microsoft maintains this article. The following contributors wrote this article.

Principal author:

Other contributors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps