Share via


Apache Spark overview

Apache Spark is the technology powering compute clusters and SQL warehouses in Azure Databricks.

This page provides an overview of the documentation in this section.

Get started

Get started working with Apache Spark on Databricks.

Topic Description
Apache Spark on Azure Databricks Get answers to frequently asked questions about Apache Spark on Azure Databricks.
Tutorial: Load and transform data using Apache Spark DataFrames Follow a step-by-step guide for working with Spark DataFrames in Python, R, or Scala for data loading and transformation.
PySpark basics Learn the basics of using PySpark by walking through simple examples.

Additional resources

Explore other Spark capabilities and documentation.

Topic Description
Set Spark configuration properties on Azure Databricks Set Spark configuration properties to customize settings in your compute environment and optimize performance.
Structured Streaming Read an overview of Structured Streaming, a near real-time processing engine.
Diagnose cost and performance issues using the Spark UI Learn to use the Spark UI for performance tuning, debugging, and cost optimization of Spark jobs.
Use Apache Spark MLlib on Azure Databricks Distributed machine learning using Spark MLlib and integration with popular ML frameworks.

Spark APIs

Work with Spark using your preferred programming language.

Topic Description
Reference for Apache Spark APIs API reference overview for Apache Spark, including links to reference for Spark SQL, DataFrames, and RDD operations across supported languages.
PySpark Use Python with Spark including PySpark basics, custom data sources, and Python-specific optimizations.
Pandas API on Spark Leverage familiar pandas syntax with the scalability of Spark for distributed data processing.
R for Spark Work with R and Spark using SparkR and sparklyr for statistical computing and data analysis.
Scala for Spark Build high-performance Spark applications using Scala with native Spark APIs and type safety.