ADLS data incoporation with other data source

Question

ADLS data incoporation with other data source

azure_learner 900

Hi friends, we have an ERP data in ADLS and we have come up with KPIs on that dataset, but business want KPIs that consist of ERP and legacy application data which has data since that legacy database host data since 1985 hence it want KPIs that enlist ERP and legacy data. The legacy system data is not planned to bring into ADLS atleast for some foreseable future time.

But ERP data we have in the data lake whereas legacy data resides on an Azure managed instance.

I understand I have few options at my disposal that are :

Through Synapse through Synapse spark
Databricks by connecting through UC
PowerBI through direct query
ADF through dataflows

Please suggest which route is the best option in terms of scalability and performance. Please suggest what are the pros and cons of above approaches and which is the safest option. What are pitfalls I should be aware of?

Thankful and grateful for your informed answers. Much appreciated.

Answer accepted by question author

0 additional answers

Your answer

Answer 1

Vinodh247 39,376 MVP Volunteer Moderator

Hi ,

Thanks for reaching out to Microsoft Q&A.

tldr:

Short term: if the KPI requirements are small and not latency sensitive, synapse serverless with Power BI will do.
Medium to long term: especially for scalable KPI computation or AIML integration, Databricks with Unity Catalog is your best option.

Synapse (Serverless SQL / Dedicated SQL)

Query ADLS via Synapse Serverless SQL or Spark, join with legacy data via Linked Server/PolyBase or external tables.
Pros:
- Good for adhoc joins between ADLS and SQL MI.
Serverless SQL reduces infrastructure management.
- Tight integration with PBI.
Cons:
- Performance is not optimal for highvolume joins.
Cross-source joins can get expensive as query size scales. Requires manual performance tuning and partitioning. Best fit: If datasets are small to medium and queries are batch-oriented rather than real-time.

Synapse Spark

Use Synapse Spark pools to read ERP data from ADLS and connect to SQL MI via JDBC/ODBC for legacy data.
Pros:

Scales well for large datasets.

Spark transformations can enrich or preaggregate before KPI calculation.

Easy integration with Fabric or Power BI.

Cons:

 Complex to manage pipelines.
 
    Real-time KPIs require more orchestration.
    
       Cost can rise quickly if Spark pools are not optimized.

Best fit: Heavy transformations and largescale KPI computation done in batches.

Databricks (with Unity Catalog)

Read ADLS data natively, connect to Azure SQL MI using JDBC, register both in UC, and create views for KPIs.
Pros:

Best scalability and performance for both batch and near-real-time.

Strong governance with UC.

Easy to automate and orchestrate pipelines.

Cons:

 Slightly higher skill requirement for setup and optimization.
 
    Cost management must be done carefully to avoid overruns.
    
    Best fit: If you want future-proofing, high performance, and an open architecture for expansion.

Power BI (Direct Query to SQL MI + Import from ADLS)

Use Direct Query mode for SQLMI data and Import mode for ADLS or link via Synapse serverless.

Pros:

Fast to implement.

  Good for exploratory or low-volume KPI reports.
  
  Cons:
  
     Performance degrades heavily with complex joins or large datasets.
     
        High query latency with Direct Query.
        
           Governance and data transformations are limited.

Best fit: Small-scale, lightweight dashboards, not so heavy KPI processing.

ADF with Dataflows

Use ADF to move legacy data periodically into ADLS, then process KPIs downstream.

Pros:

Stable ETL orchestration.

  Simplifies joins once both data sources are in the lake.

Cons:

You mentioned legacy data is not moving to ADLS.

Limited to batch, no realtime capability.

Additional cost and latency in data movement.

Best fit: Only viable if you change the policy and ingest legacy data into the lake.

Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.

VRISHABHANATH PATIL 1,380 Microsoft External Staff Moderator

Hi azure_learner,

Welcome to Microsoft QA -

You appear to be integrating ERP data stored in Azure Data Lake Storage (ADLS) with legacy application data from an Azure Managed Instance to support KPI calculations. You've already identified strong architectural options for connecting these data sources, and you're now seeking guidance on how each option performs in terms of scalability, efficiency, and potential implementation challenges. Below is a detailed breakdown of the technologies you've considered:

Azure Synapse Analytics

Serverless SQL Pools: Ideal for ad-hoc querying and exploration. Automatically scales based on query needs, but performance can degrade with poorly partitioned data or excessive concurrent queries.
Dedicated SQL Pools: Offers manual scaling via DWUs. Requires careful management of distribution keys to avoid data skew and hotspots.
Synapse Spark Pools: Good for big data processing. Scale-up triggers on high utilization; scale-down is slower.
Pitfalls: Manual scaling, concurrency limits (128 queries max), and data skew issues can impact performance.

Azure Databricks with Unity Catalog

Strengths: Optimized Spark engine with up to 50x performance gains, robust ML capabilities, and seamless integration with Synapse and ADF
Lakehouse Architecture: Supports Delta Lake, streaming, and real-time analytics. Unity Catalog enhances governance and metadata management.
Pitfalls: Complexity in managing large metadata, real-time operations, and historical data versions.

Power BI (Direct Query)

Strengths: Real-time data access and visualization. Best used with pre-aggregated datasets or cached queries to avoid performance bottlenecks.
Pitfalls: Direct queries can drain serverless SQL pool resources. Avoid returning millions of records; use scheduled refreshes.

Azure Data Factory (ADF)

Strengths: Fully managed, scalable ETL service. Supports Delta Lake, integrates with Synapse and Databricks, and offers serverless scalability.
Pitfalls: Performance depends on data flow design and integration runtime configuration. Monitoring and debugging can be complex.

Best Practices and Recommendations

Platform	Best Use Case	Scalability	Performance	Pitfalls
Platform	Best Use Case	Scalability	Performance	Pitfalls
Synapse Serverless SQL	Ad-hoc queries, lightweight analytics	Auto-scaled	Good with optimized data	Data skew, concurrency limits
Synapse Dedicated SQL	Mission-critical workloads	Manual scaling	High with proper tuning	Manual management, scaling delays
Databricks + Unity Catalog	Advanced analytics, ML, streaming	High	Excellent with Delta Lake	Metadata complexity, cost
Power BI Direct Query	Real-time dashboards	Limited	Depends on backend	Query load, refresh issues
ADF Dataflows	ETL orchestration	Serverless	Good with optimized flows	Debugging, runtime tuning

Few more additional approaches -

For scalable, high-performance analytics: Use Azure Databricks with Unity Catalog for data engineering and ML workloads, and Synapse Dedicated SQL Pools for structured, high-volume queries.
For cost-effective ad-hoc querying: Leverage Synapse Serverless SQL Pools with optimized Parquet files and partitioning.
For orchestration and integration: Use ADF to manage pipelines across services, especially when combining Synapse and Databricks
For visualization: Use Power BI with pre-aggregated datasets or scheduled refreshes to avoid performance issues.

Key Technical Considerations and Risks

Data Governance and Access Control
- When working with platforms like Databricks or Apache Spark, ensure that access permissions, role-based controls, and data lineage tracking are properly configured.
  - Misaligned permissions can lead to unauthorized access, data leakage, or compliance violations—especially when integrating with Unity Catalog or external data sources.
  1. Cost Management and Resource Optimization
    - Services like Spark, Databricks, and Synapse Serverless SQL can incur high costs if queries are unoptimized or if compute resources are over-provisioned.
      - Use monitoring tools (e.g., Azure Cost Management, Databricks usage reports) to track job execution time, data scan volumes, and concurrency levels.
        
        Implement query folding, partition pruning, and caching strategies to reduce unnecessary compute and storage charges.
        
        Real-Time Data Processing and Orchestration
        
        If you're enabling real-time updates (e.g., streaming data into dashboards or triggering workflows), be prepared to manage complex orchestration pipelines.
        
        Tools like Azure Data Factory (ADF) and Spark Structured Streaming require careful configuration of triggers, retry policies, and dependency chains to ensure reliability and fault tolerance.
        
        Latency, data freshness, and pipeline failures can impact downstream analytics if not properly handled.

Short-Term Recommendation

If your KPI needs are relatively simple and you're working with smaller datasets, consider starting with:

Azure Synapse Serverless SQL: Ideal for lightweight, ad-hoc querying with auto-scaling capabilities.
Power BI with Direct Query: Enables real-time dashboards when paired with Synapse Serverless, though performance may vary depending on backend load.

This combination is cost-effective and quick to implement for basic analytics and reporting.

Medium to Long-Term Recommendation

If you're expecting more complex analytics, machine learning, or greater scalability, it's worth investing in:

Azure Databricks with Unity Catalog: Offers robust performance for large-scale data engineering, real-time streaming, and advanced governance. Unity Catalog simplifies metadata management and access control across workspaces.

This setup is better suited for enterprise-grade workloads and future-proofing your data architecture.

Regards,
Vrishabh

Share via

ADLS data incoporation with other data source

0 additional answers

Your answer