Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to: 
 SQL Server 2019 (15.x)
Important
The Microsoft SQL Server 2019 Big Data Clusters is retired. Support for SQL Server 2019 Big Data Clusters ended as of February 28, 2025. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
Sparklyr provides an R interface for Apache Spark. Sparklyr is a popular way for R developers to use Spark. This article describes how to use sparklyr in a SQL Server 2019 Big Data Clusters using RStudio.
Prerequisites
Install R and RStudio Desktop
Install and configure RStudio Desktop with the following steps:
- If you are running on a Windows client, download and install R 3.6.3. Also, download and install RTools 3.5. Make sure to configure RTools binary folder on your PATH environment variable. - Warning - R version 4.x and sparklyr versions other that the one specified below are verified not to work as of SQL Server Big Data Clusters CU13. 
- Download and install RStudio Desktop. Optionally, all samples work on the R shell. 
- After installation completes, run the following commands inside of RStudio Desktop or R shell to install the required packages. When asked, confirm to compile packages from source. 
install.packages("devtools")
devtools::install_github('rstudio/sparklyr', ref = 'v1.7.0', upgrade = 'always', repos = 'https://cran.microsoft.com/snapshot/2021-06-11/')
Connect to Spark in a big data cluster
You can use sparklyr to connect from a client to the big data cluster using Livy and the HDFS/Spark gateway.
In RStudio, create an R script and connect to Spark as in the following example:
Tip
For the <AZDATA_USERNAME> and <AZDATA_PASSWORD> values, use the username and password you set during the big data cluster deployment.
Beginning with SQL Server 2019 (15.x) CU 5, when you deploy a new cluster with basic authentication all endpoints including gateway use AZDATA_USERNAME and AZDATA_PASSWORD. Endpoints on clusters that are upgraded to CU 5 continue to use root as username to connect to gateway endpoint. This change does not apply to deployments using Active Directory authentication. See Credentials for accessing services through gateway endpoint in the release notes.
For the <IP> and <PORT> values, see the documentation on connecting to a big data cluster.
library(sparklyr)
library(dplyr)
library(DBI)
#Specify the Knox username and password
config <- livy_config(user = "<AZDATA_USERNAME>", password = "<AZDATA_PASSWORD>")
httr::set_config(httr::config(ssl_verifypeer = 0L, ssl_verifyhost = 0L))
sc <- spark_connect(master = "https://<IP>:<PORT>/gateway/default/livy/v1",
                    method = "livy",
                    config = config)
Run sparklyr queries
After connecting to Spark, you can run sparklyr. The following example performs a query on the iris dataset using sparklyr:
iris_tbl <- copy_to(sc, iris)
iris_count <- dbGetQuery(sc, "SELECT COUNT(*) FROM iris")
iris_count
Distributed R computations
One feature of sparklyr is the ability to distribute R computations with spark_apply.
Because big data clusters use Livy connections, you must set packages = FALSE in the call to spark_apply. For more information, see the Livy section of the sparklyr documentation on distributed R computations. With this setting, you can only use the R packages that are already installed on your Spark cluster in the R code passed to spark_apply. The following example demonstrates this functionality:
iris_tbl %>% spark_apply(function(e) nrow(e), names = "nrow", group_by = "Species", packages = FALSE)
Next steps
For more information about big data clusters, see What are SQL Server 2019 Big Data Clusters.