MLlib (RDDbased)¶
Classification¶

Classification model trained using Multinomial/Binary Logistic Regression. 
Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. 

Train a classification model for Multinomial/Binary Logistic Regression using Limitedmemory BFGS. 


Model for Support Vector Machines (SVMs). 
Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. 


Model for Naive Bayes classifiers. 
Train a Multinomial Naive Bayes model. 

Train or predict a logistic regression model on streaming data. 
Clustering¶

A clustering model derived from the bisecting kmeans method. 
A bisecting kmeans algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. 


A clustering model derived from the kmeans method. 
Kmeans clustering. 


A clustering model derived from the Gaussian Mixture Model method. 
Learning algorithm for Gaussian Mixtures using the expectationmaximization algorithm. 


Model produced by 
Power Iteration Clustering (PIC), a scalable graph clustering algorithm. 


Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. 

Clustering model which can perform an online update of the centroids. 
Train Latent Dirichlet Allocation (LDA) model. 


A clustering model derived from the LDA method. 
Evaluation¶

Evaluator for binary classification. 

Evaluator for regression. 

Evaluator for multiclass classification. 

Evaluator for ranking algorithms. 
Feature¶

Normalizes samples individually to unit L^{p} norm 

Represents a StandardScaler model that can transform vectors. 

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. 

Maps a sequence of terms to their term frequencies using the hashing trick. 

Represents an IDF model that can transform term frequency vectors. 

Inverse document frequency (IDF). 

Word2Vec creates vector representation of words in a text corpus. 

class for Word2Vec model 

Creates a ChiSquared feature selector. 

Represents a Chi Squared selector model. 

Scales each column of the vector, with the supplied weight vector. 
Frequency Pattern Mining¶
A Parallel FPgrowth algorithm to mine frequent itemsets. 


A FPGrowth model for mining frequent itemsets using the Parallel FPGrowth algorithm. 
A parallel PrefixSpan algorithm to mine frequent sequential patterns. 


Model fitted by PrefixSpan 
Vector and Matrix¶

A dense vector represented by a value array. 

A simple sparse vector class for passing data to MLlib. 
Factory methods for working with vectors. 




Columnmajor dense matrix. 

Sparse Matrix stored in CSC format. 

Represents QR factors. 
Distributed Representation¶

Represents a distributed matrix in blocks of local matrices. 

Represents a matrix in coordinate format. 
Represents a distributively stored matrix backed by one or more RDDs. 


Represents a row of an IndexedRowMatrix. 

Represents a roworiented distributed Matrix with indexed rows. 

Represents an entry of a CoordinateMatrix. 

Represents a roworiented distributed Matrix with no meaningful row indices. 

Represents singular value decomposition (SVD) factors. 
Random¶
Generator methods for creating RDDs comprised of i.i.d samples from some distribution. 
Recommendation¶

A matrix factorisation model trained by regularized alternating leastsquares. 
Alternating Least Squares matrix factorization 

Represents a (user, product, rating) tuple. 
Regression¶

Class that represents the features and labels of a data point. 

A linear model that has a vector of coefficients and an intercept. 

A linear regression model derived from a leastsquares fit. 
Train a linear regression model with no regularization using Stochastic Gradient Descent. 


A linear regression model derived from a leastsquares fit with an l_2 penalty term. 
Train a regression model with L2regularization using Stochastic Gradient Descent. 


A linear regression model derived from a leastsquares fit with an l_1 penalty term. 
Train a regression model with L1regularization using Stochastic Gradient Descent. 


Regression model for isotonic regression. 
Isotonic regression. 


Base class that has to be inherited by any StreamingLinearAlgorithm. 

Train or predict a linear regression model on streaming data. 
Statistics¶

Trait for multivariate statistical summary of a data matrix. 

Contains test results for the chisquared hypothesis test. 
Represents a (mu, sigma) tuple 

Estimate probability density at required points given an RDD of samples from the population. 


Contains test results for the chisquared hypothesis test. 

Contains test results for the KolmogorovSmirnov test. 
Tree¶

A decision tree model for classification or regression. 
Learning algorithm for a decision tree model for classification or regression. 


Represents a random forest model. 
Learning algorithm for a random forest model for classification or regression. 


Represents a gradientboosted tree model. 
Learning algorithm for a gradient boosted trees model for classification or regression. 
Utilities¶
Mixin for classes which can load saved models using its Scala implementation. 

Mixin for models that provide save() through their Scala implementation. 

Utils for generating linear data. 

Mixin for classes which can load saved models from files. 

Helper methods to load, save and preprocess data used in MLlib. 

Mixin for models and transformers which may be saved as files. 