Storage options and solution

azure_learner 900 Reputation points
2025-09-06T15:08:16.5033333+00:00

Hi kind friends, as we are moving data views from virtualization tool to Azure, we are left with two options in data landing, create these views in Synapse dedicated pool and data would reside in table within Synapse, another option is follow the current medallion architecture, and have data in ADLS .But the challenges is we are having hundreds of hundreds views, and pushing data into ADLS require too many pipelines.

I tend to prefer ADLS (medallion) as the data would reside in single source of truth, and it would part of datalake house scheme of thing future forward. Different silos of data spread across would not an efficient and scalable solution anyway as it might bring various dimensions of complexity such as governance and compliance. And future enhancement would be costly solution.

As for as many pipelines that might need to be created could be resolved by parameterized pipelines and meta data driven pipelines which would significantly reduce the amount of pipelines.

https://free.blessedness.top/en-us/answers/questions/4378500/synapse-and-use-of-databricks

https://free.blessedness.top/en-us/answers/questions/5494016/complex-views-and-its-creation

I would be greatly appreciates your guidance and inputs on this. How can this solution could be approached which is scalable and effective in terms of costs and performance as well. Thanks very much in advance.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
{count} votes

Answer accepted by question author
  1. Marcin Policht 64,515 Reputation points MVP Volunteer Moderator
    2025-09-06T15:45:44.87+00:00

    I think you're correct regarding keeping ADLS Gen2 as your single source of truth and implementing the Medallion (Bronze→Silver→Gold) lakehouse pattern using Delta (or parquet with strong conventions). Don't dump every view as a table in a dedicated Synapse SQL pool. Instead materialize only what needs ultra-fast, concurrency-heavy BI access into a compute/sql layer (Synapse Dedicated SQL or Databricks SQL) and expose most Gold datasets via serverless/external tables or Databricks SQL endpoints. Use metadata-driven, parameterized pipelines to avoid hundreds of bespoke pipelines.

    ADLS + Medallion is the better default for a few reasons:

    • Single source of truth: ADLS + Delta provides ACID, time-travel, and easy schema evolution for many consumers and teams. This reduces scattered silos and duplicated governance work.
    • Cost elasticity: storing files in ADLS is cheap; compute (Databricks / Synapse) can be provisioned on demand. You avoid paying always-on DWUs for tables that are rarely queried.
    • Operational flexibility: Delta/Parquet files can be consumed by Spark, Synapse serverless, Databricks SQL, Power BI (via semantic layer), and other tools — so you don't tie yourself to one engine.

    Use Dedicated SQL pool for a small set of hot Gold semantic tables when you need:

    • extremely low query latency for many concurrent BI users (Power BI import/direct query patterns)
    • workload predictability that justifies pre-allocated compute
    • SQL features and tooling that consumers require and aren't satisfied by serverless/Databricks SQL.

    Don't move hundreds of views into dedicated tables by default — that's expensive and brittle.

    Regarding pipelines:

    1. Metadata-driven pipelines: Build pipeline templates that take dataset metadata (source path/table, schema, partitions, cadence, transform SQL/notebook) and drive everything from a control table. ADF / Synapse Pipelines + parameterization works well. Microsoft even provides tools/wizards to bootstrap metadata-driven copy workflows.
    2. Parameterizable dataflows / notebooks: Use generic Spark notebooks or Synapse pipelines that accept parameters (source, target layer, incremental watermark, transformations) and execute same logic for many views.
    3. Copy Data Tool / Control Table generation: Use the Copy Data tool or automation scripts to generate control tables and sample pipelines, then extend. This reduces manual pipeline count.
    4. Orchestration: One orchestration pipeline (or scheduled job) that reads the metadata/control table and launches per-dataset jobs — rather than N pipelines.

    Effectively, your hybrid architecture would use the following pattern:

    1. Bronze (raw): ingest files/CDC into ADLS Gen2 (Delta ideally). Small files, schema capture.
    2. Silver (cleaned/joined): dedupe, enforce schema, standardize datetimes, IDs; store as Delta, partitioned appropriately.
    3. Gold (aggregates, business models): keep canonical gold datasets in ADLS Delta. For BI:
      • Expose via serverless SQL (Synapse) or Databricks SQL endpoints for ad-hoc queries (pay per query).
      • Materialize only the hottest Gold tables into Dedicated SQL pool (or move small curated extracts) when concurrency or latency requires it; refresh via incremental loads or delta copy. Use materialized views in dedicated pools where possible to accelerate complex aggregates.

    From the performance and cost perspective:

    • Partitioning + file sizing: keep file sizes optimal (~100MB–1GB depending on workload), avoid many tiny files, and partition on selective columns for big scans. Use compaction/OPTIMIZE for Delta.
    • Use serverless for low/variable query loads; use dedicated only for consistent heavy usage. Serverless charges by data scanned — often cheaper for exploratory workloads.
    • Materialized views / pre-aggregations: use them in Dedicated SQL pool for repeated complex queries (they're specifically recommended for performance tuning).
    • Autoscaling / pause-resume: for dedicated compute, use schedule/auto-pause and reserved capacity if predictable to control cost. For Databricks, use autoscaling and serverless SQL warehouses where available.

    In regard to governance and compliance, use Microsoft Purview or a metadata/catalog solution (and/or Unity Catalog for Databricks) to track datasets, classifications, and lineage. Enforce RBAC and ACLs on ADLS containers and table access. Centralized SSoT (ADLS) simplifies policy application.

    So your migration plan could be divided into the following stages:

    1. Inventory: export list of current views (schema, source systems, frequency, consumers, row counts, criticality).
    2. Classify: label each view as cold/occasional, warm/analyst, or hot/BI. Only hot ones are candidates for Dedicated SQL materialization.
    3. Create metadata control table: for each dataset include source, bronze/silver/gold target path, incremental key, schedule, notebook/template id.
    4. Build template pipelines: create a few robust parameterized pipelines / Spark notebooks that implement ingest, dedupe, and transform using control table values. Test them with 5–10 varied datasets.
    5. Automate bootstrap: use Copy Data tool or script to generate metadata rows and create dataset entries in bulk for many views.
    6. Expose data: point BI teams at serverless SQL/Databricks SQL over the Gold layer; measure query latency and concurrency. If SLA cannot be met, select minimal set to push into Dedicated SQL.
    7. Operate & optimize: run regular OPTIMIZE/compaction, enforce retention, monitor costs and slow queries, and evolve materialized sets based on real usage patterns.

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.