Edit

Share via


Use R for Apache Spark

Microsoft Fabric provides built-in R support for Apache Spark. It supports SparkR and sparklyr, which let you use familiar Spark or R interfaces to work with Spark. Analyze data using R through Spark batch job definitions or with interactive Microsoft Fabric notebooks.

This document gives an overview of developing Spark applications in Microsoft Fabric by using R.

Prerequisites

Create and run notebook sessions

Microsoft Fabric notebook is a web interface to create files with live code, visualizations, and narrative text. Use notebooks to validate ideas, run quick experiments, and get insights from your data. Use notebooks for data preparation, data visualization, machine learning, and other big data scenarios.

To get started with R in Microsoft Fabric notebooks, change the primary language at the top of your notebook to SparkR (R).

Also, use multiple languages in one notebook by adding a language magic command at the start of a cell.

%%sparkr
# Enter your R code here

To learn more about notebooks in Microsoft Fabric Analytics, see How to use notebooks.

Install packages

Packages provide reusable code that you add to your projects. To use third-party or local packages in your projects, install them in a workspace or a notebook session.

Learn more in R library management.

Notebook utilities

Microsoft Spark Utilities (MSSparkUtils) is a built-in package that helps you perform common tasks. Use MSSparkUtils to work with file systems, get environment variables, chain notebooks together, and work with secrets. MSSparkUtils supports R notebooks.

To get started, run the following commands:

library(notebookutils)
mssparkutils.fs.help()

Learn more in Use Microsoft Spark Utilities.

Use SparkR

SparkR is an R package that provides a lightweight front end for using Apache Spark from R. SparkR provides a distributed DataFrame implementation that supports operations such as selection, filtering, and aggregation. SparkR also supports distributed machine learning with MLlib.

Learn more in How to use SparkR.

Use sparklyr

sparklyr is an R interface to Apache Spark. Use familiar R interfaces to interact with Spark. Use sparklyr in Spark batch job definitions or interactive Microsoft Fabric notebooks.

Learn more in How to use sparklyr.

Use Tidyverse

Tidyverse is a collection of R packages that data scientists use for everyday data analysis. It includes packages for data import (readr), data visualization (ggplot2), data manipulation (dplyr, tidyr), and functional programming (purrr). Tidyverse packages work together and follow consistent design principles. Microsoft Fabric distributes the latest stable version of tidyverse with every runtime release.

Learn more in How to use Tidyverse.

R visualization

The R ecosystem includes many graphing libraries. By default, each Spark instance in Microsoft Fabric includes curated open source libraries. Use the Microsoft Fabric library management capabilities to add or manage libraries and versions.

Learn how to create R visualizations in R visualization.