FileDataStream Class  
Data view from a file.
Constructor
FileDataStream(filename, schema, roles=None)Examples
   from nimbusml import FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd
   data = pd.DataFrame(dict(real = [0.1, 2.2],
                            text = ['word','class'],
                            y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)
   ds = FileDataStream.read_csv('data.csv', collapse = False,
                               numeric_dtype = np.float32, sep = ',')
   ds.head()
   #   real   text    y
   #0   0.1   word  1.0
   #1   2.2  class  3.0
   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])
   exp.fit(ds, 'y')
Remarks
FileDataStream enables training from files by streaming the
examples sequentially. Some trainers require the
full data matrix to be resident in memory, and will cache the
data if required. For trainers that implement
online or batch techniques, using FileDataStream will substantially
reduce overall memory utilization. Runtime
efficiency is also increased and data copying is minimized for
nimbusml trainers/transforms when used in
conjunction with FileDataStream text reader.
A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.
For more details of the schema format, refer to Schema and DataSchema.
Methods
| clone | Copy/clone the object. | 
| read_csv | Creates a FileDataStream from a filename or a buffer. For more
details of the schema format for
a FileDataStream, refer to
Schema
all the arguments that  | 
| read_csv_pandas | Creates a FileDataStream from a filename or a buffer. The method leverages read_csv to guess the schema of a filename with the first nrows of a file. | 
clone
Copy/clone the object.
clone()read_csv
Creates a FileDataStream from a filename or a buffer. For more
details of the schema format for
a FileDataStream, refer to
Schema
all the arguments that DataSchema.read_schema() uses applies to
this method as well.
read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)Parameters
| Name | Description | 
|---|---|
| filepath_or_buffer 
				Required
			 | filename or stream | 
| tool 
				Required
			 | parser to choose to guess the schema,
this module  | 
| nrows 
				Required
			 | number of rows used to guess the schema | 
| numeric_dtype 
				Required
			 | changes all numeric types into the same one, recommended to use numpy.float32 in many cases | 
| collapse 
				Required
			 | (False by default), collapse columns for of the same
type if it follows
read_csv function. Use internal structure of a dataframe.
If  | 
| sep 
				Required
			 | seperation of the data columns, such as ',', or '/t' | 
| header 
				Required
			 | if the input data has a header, can be True or False | 
| names 
				Required
			 | rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040. | 
| dtype 
				Required
			 | overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32} | 
| kwargs 
				Required
			 | additional parameters sent to read_csv or the internal parser. | 
Returns
| Type | Description | 
|---|---|
| a FileDataStream instance | 
read_csv_pandas
Creates a FileDataStream from a filename or a buffer.
The method leverages read_csv to guess the schema of a filename with the first nrows of a file.
read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)Parameters
| Name | Description | 
|---|---|
| filepath_or_buffer 
				Required
			 | filename or stream | 
| nrows 
				Required
			 | number of rows used to guess the schema | 
| kwargs 
				Required
			 | additional parameters sent to read_csv or the internal | 
| numeric_dtype 
				Required
			 | changes all numeric types into the same one | 
| collapse 
				Required
			 | collapse into one vector column all columns sharing the same type | 
Returns
| Type | Description | 
|---|---|
| a FileDataStream instance |