RevoScaleR Functions for Spark on Hadoop

2018-01-29

The RevoScaleR package provides a set of portable, scalable, distributable data analysis functions. This page presents a curated list of functions that might be particularly interesting to Hadoop users. These functions can be called directly from the command line.

The RevoScaleR package supports two Hadoop compute contexts:

RxSpark (recommended), a distributed compute context in which computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark. This provides up to a 7x performance boost compared to RxHadoopMR. For guidance, see How to use RevoScaleR on Spark.
RxHadoopMR (deprecated), a distributed compute context on a Hadoop cluster. This compute context can be used on a node (including an edge node) of a Cloudera or Hortonworks cluster with a RHEL operating system, or a client with an SSH connection to such a cluster. For guidance, see How to use RevoScaleR on Hadoop MapReduce.

On Hadoop Distributed File System (HDFS), the XDF file format stores data in a composite set of files rather than a single file.

Data Analysis Functions

Import and Export Functions

Function Name	Description	Help
`rxDataStep`	Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output data) from an .xdf file or a data frame.	<small>View</small>
`RxXdfData`	Creates an efficient XDF data source object.	<small>View</small>
`RxTextData`	Creates a comma delimited text data source object.	<small>View</small>
`rxGetInfo`	Retrieves summary information from a data source or data frame.	<small>View</small>
`rxGetVarInfo`	Retrieves variable information from a data source or data frame.	<small>View</small>
`rxGetVarNames`	Retrieves variable names from a data source or data frame.	<small>View</small>
`rxHdfsFileSystem`	Creates an HDFS file system object.	<small>View</small>

#### Manipulation, Cleansing, and Transformation Functions

Function Name		Description	Help
`rxDataStep`		Transform and subset data. Creates an .xdf file, a comma-delimited text file, or data frame in memory (assuming you have sufficient memory to hold the output) from an .xdf file or a data frame.	<small>View</small>
`rxFactors`		Create or recode factor variables in a composite XDF file in HDFS. A new file must be written out.	<small>View</small>

#### Analysis Functions for Descriptive Statistics and Cross-Tabulations

Function Name	Description	Help
`rxQuantile`	Computes approximate quantiles for .xdf files and data frames without sorting.	<small>View</small>
`rxSummary`	Basic summary statistics of data, including computations by group. Writing by group computations to .xdf file not supported.	<small>View</small>
`rxCrossTabs`	Formula-based cross-tabulation of data.	<small>View</small>
`rxCube`	Alternative formula-based cross-tabulation designed for efficient representation returning ‘cube’ results. Writing output to .xdf file not supported.	<small>View</small>

#### Analysis, Learning, and Prediction Functions for Statistical Modeling

Function Name	Description	Help
`rxLinMod`	Fits a linear model to data.	<small>View</small>
`rxLogit`	Fits a logistic regression model to data.	<small>View</small>
`rxGlm`	Fits a generalized linear model to data.	<small>View</small>
`rxCovCor`	Calculate the covariance, correlation, or sum of squares / cross-product matrix for a set of variables.	<small>View</small>
`rxDTree`	Fits a classification or regression tree to data.	<small>View</small>
`rxBTrees`	Fits a classification or regression decision forest to data using a stochastic gradient boosting algorithm.	<small>View</small>
`rxDForest`	Fits a classification or regression decision forest to data.	<small>View</small>
`rxPredict`	Calculates predictions for fitted models. Output must be an XDF data source.	<small>View</small>
`rxKmeans`	Performs k-means clustering.	<small>View</small>
`rxNaiveBayes`	Fit Naive Bayes Classifiers on an .xdf file or data frame for small or large data using parallel external memory algorithm.	<small>View</small>

Compute Context Functions

Function Name	Description	Help
`RxHadoopMR`	Creates an in-data, file-based Hadoop compute context.	<small>View</small>
`RxSpark`	Creates an in-data, file-based Spark compute context. Computations are parallelized and distributed across the nodes of a Hadoop cluster via Apache Spark.	<small>View</small>
`rxSparkConnect`	Creates a persistent Spark compute context.	<small>View</small>
`rxSparkDisconnect`	Disconnects a Spark session and return to a local compute context.	<small>View</small>
`rxInstalledPackages`	Returns the list of installed packages for a compute context.	<small>View</small>
`rxFindPackage`	Returns the path to one or more packages for a compute context.	<small>View</small>

Data Source Functions

Of course, not all data source types are available on all compute contexts. For the Hadoop compute contexts, two types of data sources can be used.

Function Name	Description	Help
`RxXdfData`	Creates an efficient XDF data source object.	<small>View</small>
`RxTextData`	Creates a comma delimited text data source object.	<small>View</small>
`RxHiveData`	Generates a Hive Data Source object.	<small>View</small>
`RxParquetData`	Generates a Parquet Data Source object.	<small>View</small>
`rxSparkDataOps`	Lists cached `RxParquetData` or `RxHiveData` data source objects.	<small>View</small>
`rxSparkRemoveData`	Removes cached `RxParquetData` or `RxHiveData` data source objects.	<small>View</small>

## High Performance Computing and Distributed Computing Functions

The Hadoop compute context has a number of helpful functions used for high performance computing and distributed computing. Learn more about the entire set of functions in the Distributed Computing guide.

Function Name	Description	Help
`rxExec`	Run an arbitrary R function on nodes or cores of a cluster.	<small>View</small>
`rxGetJobStatus`	Get the status of a non-waiting distributed computing job.	<small>View</small>
`rxGetJobResults`	Get the return object(s) of a non-waiting distributed computing job.	<small>View</small>
`rxGetJobOutput`	Get the console output from a non-waiting distributed computing job.	<small>View</small>
`rxGetJobs`	Get the available distributed computing job information objects.	<small>View</small>

## Hadoop Convenience Functions

RevoScaleR also provides some wrapper functions for accessing Hadoop/HDFS functionality via R. These functions require access to Hadoop, either locally or remotely via the RxHadoopMR or RxSpark compute contexts.

Function Name	Description	Help
`rxHadoopCommand`	Execute an arbitrary Hadoop command. Allows you to run basic Hadoop commands.	<small>View</small>
`rxHadoopVersion`	Return the current Hadoop version.	<small>View</small>
`rxHadoopCopyFromClient`	Copy a file from a remote client to the Hadoop cluster's local file system, and then to HDFS.	<small>View</small>
`rxHadoopCopyFromLocal`	Copy a file from the native file system to HDFS. Wraps the Hadoop `fs -copyFromLocal` command.	<small>View</small>
`rxHadoopCopy`	Copy a file in the Hadoop Distributed File System (HDFS). Wraps the Hadoop `fs -cp` command.	<small>View</small>
`rxHadoopRemove`	Remove a file in HDFS. Wraps the Hadoop `fs -rm` command.	<small>View</small>
`rxHadoopListFiles`	List files in an HDFS directory. Wraps the Hadoop `fs -ls` or `fs -lsr` command.	<small>View</small>
`rxHadoopMakeDir`	Make a directory in HDFS. Wraps the Hadoop `fs -mkdir` command.	<small>View</small>
`rxHadoopMove`	Move a file in HDFS. Wraps the Hadoop `fs -mv` command.	<small>View</small>
`rxHadoopRemoveDir`	Remove a directory in HDFS. Wraps the Hadoop `fs -rmr` command.	<small>View</small>

Feedback

Was this page helpful?