Hi ,
Thanks for reaching out to Microsoft Q&A.
seems your issue is a single DP agent bottleneck under concurrent high volume JDBC loads from azure databricks to SAP Datasphere (DSP). When the agent overloads, it disconnects for 10 minutes, causing cascading taskchain failures.
Recommendations:
- Scale DP Agents: Deploy multiple DP Agents (ideally 1 per DSP space or space group) to distribute load and remove the single point of failure.
Throttle Databricks concurrency: Reduce Spark parallelism using coalesce() or repartition() and control the number of concurrent JDBC writers.
- Use JDBC batching: Enable
batchsize(1,000 - 5,000 rows) and disable auto-commit to reduce transaction overhead on the DP Agent. - Add retry and backoff logic: Implement exponential backoff retries (ex: 30s -> 90s -> 180s) in orchestrations to handle temporary disconnects gracefully.
Stagger taskchains: Schedule chain start times to avoid overlapping execution across spaces.
Improve monitoring: Track DP Agent CPU, memory, and connection counts; alert on rising load before disconnections.
Long-term: Decouple Databricks from DSP by staging data in ADLS and letting DSP import from there, ensuring scalability and resilience.
Expected Outcome: This approach stabilizes DSP ingestion, isolates failures, and restores reliability for SAC dashboards with minimal architectural disruption.
Please 'Upvote'(Thumbs-up) and 'Accept' as answer if the reply was helpful. This will be benefitting other community members who face the same issue.