I think you're correct regarding keeping ADLS Gen2 as your single source of truth and implementing the Medallion (Bronze→Silver→Gold) lakehouse pattern using Delta (or parquet with strong conventions). Don't dump every view as a table in a dedicated Synapse SQL pool. Instead materialize only what needs ultra-fast, concurrency-heavy BI access into a compute/sql layer (Synapse Dedicated SQL or Databricks SQL) and expose most Gold datasets via serverless/external tables or Databricks SQL endpoints. Use metadata-driven, parameterized pipelines to avoid hundreds of bespoke pipelines.
ADLS + Medallion is the better default for a few reasons:
- Single source of truth: ADLS + Delta provides ACID, time-travel, and easy schema evolution for many consumers and teams. This reduces scattered silos and duplicated governance work.
- Cost elasticity: storing files in ADLS is cheap; compute (Databricks / Synapse) can be provisioned on demand. You avoid paying always-on DWUs for tables that are rarely queried.
- Operational flexibility: Delta/Parquet files can be consumed by Spark, Synapse serverless, Databricks SQL, Power BI (via semantic layer), and other tools — so you don't tie yourself to one engine.
Use Dedicated SQL pool for a small set of hot Gold semantic tables when you need:
- extremely low query latency for many concurrent BI users (Power BI import/direct query patterns)
- workload predictability that justifies pre-allocated compute
- SQL features and tooling that consumers require and aren't satisfied by serverless/Databricks SQL.
Don't move hundreds of views into dedicated tables by default — that's expensive and brittle.
Regarding pipelines:
- Metadata-driven pipelines: Build pipeline templates that take dataset metadata (source path/table, schema, partitions, cadence, transform SQL/notebook) and drive everything from a control table. ADF / Synapse Pipelines + parameterization works well. Microsoft even provides tools/wizards to bootstrap metadata-driven copy workflows.
- Parameterizable dataflows / notebooks: Use generic Spark notebooks or Synapse pipelines that accept parameters (source, target layer, incremental watermark, transformations) and execute same logic for many views.
- Copy Data Tool / Control Table generation: Use the Copy Data tool or automation scripts to generate control tables and sample pipelines, then extend. This reduces manual pipeline count.
- Orchestration: One orchestration pipeline (or scheduled job) that reads the metadata/control table and launches per-dataset jobs — rather than N pipelines.
Effectively, your hybrid architecture would use the following pattern:
- Bronze (raw): ingest files/CDC into ADLS Gen2 (Delta ideally). Small files, schema capture.
- Silver (cleaned/joined): dedupe, enforce schema, standardize datetimes, IDs; store as Delta, partitioned appropriately.
- Gold (aggregates, business models): keep canonical gold datasets in ADLS Delta. For BI:
- Expose via serverless SQL (Synapse) or Databricks SQL endpoints for ad-hoc queries (pay per query).
- Materialize only the hottest Gold tables into Dedicated SQL pool (or move small curated extracts) when concurrency or latency requires it; refresh via incremental loads or delta copy. Use materialized views in dedicated pools where possible to accelerate complex aggregates.
From the performance and cost perspective:
- Partitioning + file sizing: keep file sizes optimal (~100MB–1GB depending on workload), avoid many tiny files, and partition on selective columns for big scans. Use compaction/OPTIMIZE for Delta.
- Use serverless for low/variable query loads; use dedicated only for consistent heavy usage. Serverless charges by data scanned — often cheaper for exploratory workloads.
- Materialized views / pre-aggregations: use them in Dedicated SQL pool for repeated complex queries (they're specifically recommended for performance tuning).
- Autoscaling / pause-resume: for dedicated compute, use schedule/auto-pause and reserved capacity if predictable to control cost. For Databricks, use autoscaling and serverless SQL warehouses where available.
In regard to governance and compliance, use Microsoft Purview or a metadata/catalog solution (and/or Unity Catalog for Databricks) to track datasets, classifications, and lineage. Enforce RBAC and ACLs on ADLS containers and table access. Centralized SSoT (ADLS) simplifies policy application.
So your migration plan could be divided into the following stages:
- Inventory: export list of current views (schema, source systems, frequency, consumers, row counts, criticality).
- Classify: label each view as cold/occasional, warm/analyst, or hot/BI. Only hot ones are candidates for Dedicated SQL materialization.
- Create metadata control table: for each dataset include source, bronze/silver/gold target path, incremental key, schedule, notebook/template id.
- Build template pipelines: create a few robust parameterized pipelines / Spark notebooks that implement ingest, dedupe, and transform using control table values. Test them with 5–10 varied datasets.
- Automate bootstrap: use Copy Data tool or script to generate metadata rows and create dataset entries in bulk for many views.
- Expose data: point BI teams at serverless SQL/Databricks SQL over the Gold layer; measure query latency and concurrency. If SLA cannot be met, select minimal set to push into Dedicated SQL.
- Operate & optimize: run regular OPTIMIZE/compaction, enforce retention, monitor costs and slow queries, and evolve materialized sets based on real usage patterns.
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin