Apache Spark overview

Apache Spark is the technology powering compute clusters and SQL warehouses in Azure Databricks.

This page provides an overview of the documentation in this section.

Get started

Get started working with Apache Spark on Databricks.

Topic	Description
Apache Spark on Azure Databricks	Get answers to frequently asked questions about Apache Spark on Azure Databricks.
Tutorial: Load and transform data using Apache Spark DataFrames	Follow a step-by-step guide for working with Spark DataFrames in Python, R, or Scala for data loading and transformation.
PySpark basics	Learn the basics of using PySpark by walking through simple examples.

Explore other Spark capabilities and documentation.

Topic	Description
Set Spark configuration properties on Azure Databricks	Set Spark configuration properties to customize settings in your compute environment and optimize performance.
Structured Streaming	Read an overview of Structured Streaming, a near real-time processing engine.
Diagnose cost and performance issues using the Spark UI	Learn to use the Spark UI for performance tuning, debugging, and cost optimization of Spark jobs.
Use Apache Spark MLlib on Azure Databricks	Distributed machine learning using Spark MLlib and integration with popular ML frameworks.

Work with Spark using your preferred programming language.

Topic	Description
Reference for Apache Spark APIs	API reference overview for Apache Spark, including links to reference for Spark SQL, DataFrames, and RDD operations across supported languages.
PySpark	Use Python with Spark including PySpark basics, custom data sources, and Python-specific optimizations.
Pandas API on Spark	Leverage familiar pandas syntax with the scalability of Spark for distributed data processing.
R for Spark	Work with R and Spark using SparkR and sparklyr for statistical computing and data analysis.
Scala for Spark	Build high-performance Spark applications using Scala with native Spark APIs and type safety.

Was this page helpful?