ADLS and optimal solution

Question

ADLS and optimal solution

azure_learner 900

Hi Kind friends, the scenario I am currently in , and I have made lots of deliberations but I need your expert guidance. We have a database that is currently moved and hosted in Azure managed instance(MI) , and we have an ERP tool which is already onboarded into ADLS.

Our business stakeholders want KPIs that encompass both the database in MI as well as ERP data in the ADLS, both the datasets are similar, not exactly the same.

I have suggested that the data moved from MI to ADLS as well as a single source of truth , and one repository considering various factors including compliance and governance.

The KPIs can be done and feasible combining the both MI and ADLS data , but it has potential pitfalls that are data consistency ,performance and scalability ,data arrival and data retrieval from both the sources which should be seamless.

I request you to please provide me with a detailed walkthrough in terms of feasibility and anticipated pitfalls in terms of suitability on above listed factors and as well as cost as MI might be an expensive solution in comparison to ADLS. Thank you in advance.

Answer accepted by question author

0 additional answers

Your answer

Answer 1

If your KPIs are analytic/aggregative (reporting, dashboards, cross-system joins), then I'd suggest that you keep the MI as the OLTP system of record and build a governed lakehouse on ADLS as the analytics Single Source of Truth (SOT). Use log-based CDC (MI → ADLS) into a Bronze → Silver → Gold (medallion) pipeline (Delta/Parquet), then compute KPIs from the Gold layer with Synapse or Databricks. This gets the best balance of governance, cost, performance and scalability while avoiding breaking transactional behavior on MI. The CDC + lakehouse approach is supported by Azure Data Factory / Synapse and Delta Lake patterns.

Regarding architecture options, consider the following:

Option A — Full lakehouse SOT (move MI data → ADLS and treat ADLS as THE canonical SOT for BI)

Feasible? Yes for analytics/KPIs. ADLS (Gen2) + Delta / medallion pattern is a common approach.
When to pick: If transactional integrity for operational apps is not required from the MI any more (i.e., you can retire MI as the write SOR), or the business accepts eventual consistency and analytics-only SOT.
Pros: cheaper storage footprint (ADLS storage is low cost), unified data model combining ERP + MI data in same repo, easier lineage/governance with Purview.
Cons / Red flags: difficulty preserving database transactional semantics (ACID) in a file system; CDC/replication edge cases (truncates, table-level operations) and schema drift; analytics queries still require compute (Databricks/Synapse) so there are compute costs.

Option B — Hybrid (recommended) — MI remains OLTP SOR; stream/replicate to ADLS as the analytics SOT

Feasible? Highly — this is the recommended enterprise pattern: MI for operational workloads, ADLS lakehouse (Delta) for analytics; CDC or incremental copy to land raw changes (bronze), then transform to silver/gold. Azure Data Factory / Synapse / Databricks fully support CDC from MI.
Why recommended: keeps transactional guarantees where they're needed, gives analysts a scalable, low-cost analytics SOT that joins cleanly with ERP in ADLS, and supports governance (catalog/lineage).

Option C — Federated / virtual queries (query MI + ADLS on demand)

Feasible? Technically yes using Synapse serverless external tables or by creating managed private endpoints and federating queries, but heavy joins across systems will hurt MI performance. Use only for low-volume, near-real-time queries or one-off investigations.
When to pick: short-term, proof-of-concept, or if you cannot tolerate any replication latency and query volumes are small.

Regarding anticipated pitfalls mapped to your factors — and mitigation

Data consistency (the big one)

Pitfall: CDC is log-based and typically provides row-level inserts/updates/deletes, but some operations (for example TRUNCATE) are not captured as regular DML and can break incremental pipelines — you can't rely on CDC to capture a TRUNCATE the same way as row deletes. If you enable CDC on MI you must plan for those cases.
Mitigation:
- Use LSN/commit timestamps in the landing data so you can reconstruct ordering and reapply changes deterministically.
- For tables that are truncated as part of business process: either (a) change to a DELETE pattern that CDC captures, (b) perform controlled snapshot loads and mark affected partitions as replaced, or (c) record a control event in a control table that the downstream pipeline consumes.
- Keep a raw “bronze” copy of every incoming CDC batch (immutable), so you can replay/repair. (Medallion pattern.)

Performance & scalability

Pitfall: Querying huge raw files on ADLS with serverless SQL or naive Spark jobs is expensive (scans) and slow if files/formats/partitioning are poor. Also federated queries that hit MI under heavy BI load can degrade OLTP.
Mitigation:
- Use columnar formats (Parquet or Delta) and partition by predictable keys (date, tenant, region).
- Adopt Delta Lake features (OPTIMIZE/Z-ORDER) to reduce reads (data skipping) and use MERGE for idempotent upserts when applying CDC.
- Put the heavy aggregations/materialized tables into a compute tier (Databricks SQL warehouses, Synapse dedicated SQL pools) and refresh on a schedule; don't make dashboards scan entire raw lake each load.

Data arrival / latency / missed changes

Pitfall: Network failures, CDC capture job lags, or log truncation can cause missed or delayed changes. CDC capture jobs must be healthy and retention windows must be sized to allow capture to keep up.
Mitigation: Monitoring + alerting on CDC capture lag; conservative CDC retention windows; store both CDC metadata (LSN) and source row hash so downstream can detect gaps and reprocess windows.

Data retrieval UX (BI & analysts)

Pitfall: Analysts expect fast, small-result queries — scanning large raw files is slow and costly (serverless SQL bills on TB scanned).
Mitigation: Build Gold-level denormalized KPI tables (pre-aggregated), present them to Power BI as certified datasets (Synapse/Databricks/semantic layer). Use materialized views (or Databricks SQL cache) for low-latency interactive queries.

Governance & compliance

Pitfall: Two repositories (MI + ADLS) can confuse data ownership and access control if not governed.
Mitigation: Enforce RBAC + POSIX ACLs on ADLS, use Microsoft Purview for catalog/lineage, and maintain a data product registry for Gold tables. ADLS supports ACL + Azure RBAC.

Cost

Pitfall: MI is relatively expensive (vCore + included storage + always-on compute). ADLS storage is cheap, but analytics compute (Databricks, Synapse dedicated) or serverless scanning costs can add up. Co-locate resources to avoid inter-region egress charges.
Mitigation:
- Use ADLS for raw storage (low cost).
- Use serverless/demand compute (e.g., Synapse serverless or autoscaling Databricks clusters) for ad-hoc workloads, and move heavy/recurring aggregations to scheduled jobs that write Gold tables (not computed on each dashboard load).
- Run a small PoC with your data volumes and run the Azure Pricing Calculator to model costs (compute hours, DB vCores, storage TB, TB scanned).

Regarding cost:

Azure SQL Managed Instance — priced by vCore (can be hundreds → thousands USD/month depending on size). Good for OLTP but relatively expensive if you try to do heavy analytics in it. Use the Azure MI pricing page to get exact vCore costs.
ADLS Gen2 storage — very cheap per GB/month; storage is not usually the highest cost. But analytics compute (Databricks DBUs, Synapse dedicated pools, or serverless SQL scanning measured per TB processed) drive the spend for reporting. Run the Azure Pricing Calculator with your expected TB/month scanned to estimate costs.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

azure_learner 900 Reputation points

2025-10-10T05:49:34.94+00:00

Thank you very much @Marcin Policht for the great answer. I am grateful and couldn't appreciate enough your help.

Share via

ADLS and optimal solution

0 additional answers

Your answer