ADLS data incoporation with other data source

azure_learner 900 Reputation points
2025-08-26T15:24:44.77+00:00

Hi friends, we have an ERP data in ADLS and we have come up with KPIs on that dataset, but business want KPIs that consist of ERP and legacy application data which has data since that legacy database host data since 1985 hence it want KPIs that enlist ERP and legacy data. The legacy system data is not planned to bring into ADLS atleast for some foreseable future time.

But ERP  data we have in the data lake whereas legacy data resides on an Azure managed instance.

I understand I have few options at my disposal that are :

  1. Through Synapse  through Synapse spark
  2. Databricks by connecting through UC
  3. PowerBI through direct query
  4. ADF through dataflows

Please suggest which route is the best option in terms of scalability and performance. Please suggest what are the pros and cons of above approaches and which is the safest option. What are pitfalls I should be aware of?

Thankful and grateful for your informed answers. Much appreciated.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
0 comments No comments
{count} votes

Answer accepted by question author
  1. Vinodh247 39,376 Reputation points MVP Volunteer Moderator
    2025-08-27T12:00:59.03+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    tldr:

    1. Short term: if the KPI requirements are small and not latency sensitive, synapse serverless with Power BI will do.
    2. Medium to long term: especially for scalable KPI computation or AIML integration, Databricks with Unity Catalog is your best option.

    1. Synapse (Serverless SQL / Dedicated SQL)
    • Query ADLS via Synapse Serverless SQL or Spark, join with legacy data via Linked Server/PolyBase or external tables.
    • Pros:
      • Good for adhoc joins between ADLS and SQL MI.
      Serverless SQL reduces infrastructure management.
      • Tight integration with PBI.
    • Cons:
      • Performance is not optimal for highvolume joins.
      Cross-source joins can get expensive as query size scales. Requires manual performance tuning and partitioning. Best fit: If datasets are small to medium and queries are batch-oriented rather than real-time.

    1. Synapse Spark
    • Use Synapse Spark pools to read ERP data from ADLS and connect to SQL MI via JDBC/ODBC for legacy data.
    • Pros:

    Scales well for large datasets.

    • Spark transformations can enrich or preaggregate before KPI calculation.

    Easy integration with Fabric or Power BI.

    Cons:

     Complex to manage pipelines.
     
        Real-time KPIs require more orchestration.
        
           Cost can rise quickly if Spark pools are not optimized.
           
    
    • Best fit: Heavy transformations and largescale KPI computation done in batches.

    1. Databricks (with Unity Catalog)
    • Read ADLS data natively, connect to Azure SQL MI using JDBC, register both in UC, and create views for KPIs.
    • Pros:

    Best scalability and performance for both batch and near-real-time.

    • Strong governance with UC.

    Easy to automate and orchestrate pipelines.

    Cons:

     Slightly higher skill requirement for setup and optimization.
     
        Cost management must be done carefully to avoid overruns.
        
        Best fit: If you want future-proofing, high performance, and an open architecture for expansion.
        
    

    1. Power BI (Direct Query to SQL MI + Import from ADLS)
    • Use Direct Query mode for SQLMI data and Import mode for ADLS or link via Synapse serverless.

    Pros:

    Fast to implement.

      Good for exploratory or low-volume KPI reports.
      
      Cons:
      
         Performance degrades heavily with complex joins or large datasets.
         
            High query latency with Direct Query.
            
               Governance and data transformations are limited.
               
    
    • Best fit: Small-scale, lightweight dashboards, not so heavy KPI processing.

    1. ADF with Dataflows
    • Use ADF to move legacy data periodically into ADLS, then process KPIs downstream.

    Pros:

    Stable ETL orchestration.

      Simplifies joins once both data sources are in the lake.
      
    
    • Cons:

    You mentioned legacy data is not moving to ADLS.

    • Limited to batch, no realtime capability.

    Additional cost and latency in data movement.

    • Best fit: Only viable if you change the policy and ingest legacy data into the lake.

    Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.