Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
The RevoScaleR package provides a set of portable, scalable, distributable data analysis functions. This page presents a curated list of functions that might be particularly interesting to Hadoop users. These functions can be called directly from the command line.
The RevoScaleR package supports two Hadoop compute contexts:
RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to
RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.
On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.
Data Analysis Functions
Import and Export Functions
| Function Name | Description |
|
|
|---|---|---|---|
rxDataStep |
![]() |
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame. |
|
RxXdfData |
![]() |
Creates an efficient XDF data source object. |
|
RxTextData |
![]() |
Creates a comma delimited text data source object. |
|
rxGetInfo |
![]() |
Retrieves summary information from a data source or data frame. |
|
rxGetVarInfo |
Retrieves variable information from a data source or data frame. |
|
|
rxGetVarNames |
Retrieves variable names from a data source or data frame. |
|
|
rxHdfsFileSystem |
Creates an HDFS file system object. |
|
#### Manipulation, Cleansing, and Transformation Functions
| Function Name | Description |
|
|
|---|---|---|---|
rxDataStep |
![]() |
Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame. |
|
rxFactors |
![]() |
Create or recode factor variables in a composite XDF file in HDFS. A new file must be written out. |
|
#### Analysis Functions for Descriptive Statistics and Cross-Tabulations
| Function Name | Description |
|
|
|---|---|---|---|
rxQuantile |
![]() |
Computes approximate quantiles for .xdf files and data frames without sorting. |
|
rxSummary |
![]() |
Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported. |
|
rxCrossTabs |
![]() |
Formula-based cross-tabulation of data. |
|
rxCube |
![]() |
Alternative formula-based cross-tabulation designed for efficient representation returning ‘cube’ results. Writing output to .xdf file not supported. |
|
#### Analysis, Learning, and Prediction Functions for Statistical Modeling
| Function Name | Description |
|
|
|---|---|---|---|
rxLinMod |
![]() |
Fits a linear model to data. |
|
rxLogit |
![]() |
Fits a logistic regression model to data. |
|
rxGlm |
![]() |
Fits a generalized linear model to data. |
|
rxCovCor |
![]() |
Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables. |
|
rxDTree |
![]() |
Fits a classification or regression tree to data. |
|
rxBTrees |
![]() |
Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm. |
|
rxDForest |
![]() |
Fits a classification or regression decision forest to data. |
|
rxPredict |
![]() |
Calculates predictions for fitted models. Output must be an XDF data source. |
|
rxKmeans |
![]() |
Performs k-means clustering. |
|
rxNaiveBayes |
![]() |
Fit Naive Bayes Classifiers on an .xdf file or data frame for small or large data using parallel external memory algorithm. |
|
Compute Context Functions
| Function Name | Description |
|
|
|---|---|---|---|
RxHadoopMR |
![]() |
Creates an in-data, file-based Hadoop compute context. |
|
RxSpark |
![]() |
Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. |
|
rxSparkConnect |
Creates a persistent Spark compute context. |
|
|
rxSparkDisconnect |
Disconnects a Spark session and return to a local compute context. |
|
|
rxInstalledPackages |
Returns the list of installed packages for a compute context. | ||
rxFindPackage |
Returns the path to one or more packages for a compute context. |
Data Source Functions
Of course, not all data source types are available on all compute contexts. For the Hadoop compute contexts, two types of data sources can be used.
| Function Name | Description |
|
|
|---|---|---|---|
RxXdfData |
![]() |
Creates an efficient XDF data source object. |
|
RxTextData |
![]() |
Creates a comma delimited text data source object. |
|
RxHiveData |
Generates a Hive Data Source object. |
|
|
RxParquetData |
Generates a Parquet Data Source object. |
|
|
rxSparkDataOps |
Lists cached RxParquetData or RxHiveData data source objects. |
| |
rxSparkRemoveData |
Removes cached RxParquetData or RxHiveData data source objects. |
|
## High Performance Computing and Distributed Computing Functions
The Hadoop compute context has a number of helpful functions used for high performance computing and distributed computing. Learn more about the entire set of functions in the Distributed Computing guide.
| Function Name | Description |
|
|
|---|---|---|---|
rxExec |
Run an arbitrary R function on nodes or cores of a cluster. |
|
|
rxGetJobStatus |
Get the status of a non-waiting distributed computing job. |
|
|
rxGetJobResults |
Get the return object(s) of a non-waiting distributed computing job. |
|
|
rxGetJobOutput |
Get the console output from a non-waiting distributed computing job. |
|
|
rxGetJobs |
Get the available distributed computing job information objects. |
|
## Hadoop Convenience Functions
RevoScaleR also provides some wrapper functions for accessing Hadoop/HDFS functionality via R. These functions require access to Hadoop, either locally or remotely via the RxHadoopMR or RxSpark compute contexts.
| Function Name | Description |
|
|
|---|---|---|---|
rxHadoopCommand |
Execute an arbitrary Hadoop command. Allows you to run basic Hadoop commands. |
|
|
rxHadoopVersion |
Return the current Hadoop version. |
|
|
rxHadoopCopyFromClient |
Copy a file from a remote client to the Hadoop cluster's local file system, and then to HDFS. |
|
|
rxHadoopCopyFromLocal |
Copy a file from the native file system to HDFS. Wraps the Hadoop fs -copyFromLocal command. |
|
|
rxHadoopCopy |
Copy a file in the Hadoop Distributed File System (HDFS). Wraps the Hadoop fs -cp command. |
|
|
rxHadoopRemove |
Remove a file in HDFS. Wraps the Hadoop fs -rm command. |
|
|
rxHadoopListFiles |
List files in an HDFS directory. Wraps the Hadoop fs -ls or fs -lsr command. |
|
|
rxHadoopMakeDir |
Make a directory in HDFS. Wraps the Hadoop fs -mkdir command. |
|
|
rxHadoopMove |
Move a file in HDFS. Wraps the Hadoop fs -mv command. |
|
|
rxHadoopRemoveDir |
Remove a directory in HDFS. Wraps the Hadoop fs -rmr command. |
|
