Duomly

Posted on Dec 27, 2019 • Edited on Jun 29, 2020 • Originally published at blog.duomly.com

What is scikit learn — a beginner guide to popular machine learning Python library

#python #programming #machinelearning #beginners

This article was originally published at Scikit learn tutorial

Scikit-learn is one of the most widely-used Python packages for data science and machine learning. It enables you to perform many operations and provides a variety of algorithms. Scikit-learn also offers excellent documentation about its classes, methods, and functions, as well as the explanations on the background of used algorithms.

Scikit-learn supports:

data preprocessing,
dimensionality reduction,
model selection,
regression,
classification,
cluster analysis.

It also provides several datasets you can use to test your models.

Scikit-learn doesn’t implement everything related to machine learning. For example, it hasn’t comprehensive support for:

neural networks,
self-organizing maps (Kohonen’s networks),
association rule learning,
reinforcement learning, and so on.

Scikit-learn is built on NumPy and SciPy, so you need to understand at least the basics of these two libraries to effectively apply it.

Scikit-learn is an open-source package. Like most stuff from the Python ecosystem, it’s free even for commercial usage. It’s licensed under the BSD license.

This article aims to concisely present some of the possibilities of scikit-learn without getting too much into details.

Data Preprocessing

You can use scikit-learn to prepare your data for machine learning algorithms: standardize or normalize data, encode categorical variables, and more.

Let’s first define a NumPy array to work with:

>>> import numpy as np
>>> x = np.array([[0.1, 1.0, 22.8],
...               [0.5, 5.0, 41.2],
...               [1.2, 12.0, 2.8],
...               [0.8, 8.0, 14.0]])
>>> x
array([[ 0.1,  1. , 22.8],
       [ 0.5,  5. , 41.2],
       [ 1.2, 12. ,  2.8],
       [ 0.8,  8. , 14. ]])

You often need to transform data in such a way that the mean of each column (feature) is zero and the standard deviation is one. You can apply class sklearn.preprocessing.StandardScaler to do this:

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> scaled_x = scaler.fit_transform(x)
>>> scaler.scale_
array([ 0.40311289,  4.03112887, 14.04421589])
>>> scaler.mean_
array([ 0.65,  6.5 , 20.2 ])
>>> scaler.var_
array([1.6250e-01, 1.6250e+01, 1.9724e+02])
>>> scaled_x
array([[-1.36438208, -1.36438208,  0.18512959],
       [-0.3721042 , -0.3721042 ,  1.4952775 ],
       [ 1.36438208,  1.36438208, -1.23894421],
       [ 0.3721042 ,  0.3721042 , -0.44146288]])
>>> scaled_x.mean().round(decimals=4)
0.0
>>> scaled_x.mean(axis=0)
array([ 1.66533454e-16, -1.38777878e-17,  1.52655666e-16])
>>> scaled_x.std(axis=0)
array([1., 1., 1.])
>>> scaler.inverse_transform(scaled_x)
array([[ 0.1,  1. , 22.8],
       [ 0.5,  5. , 41.2],
       [ 1.2, 12. ,  2.8],
       [ 0.8,  8. , 14. ]])

Sometimes, you’ll have some categorical data and need to convert it to meaningful numbers. One of the ways to do that is by using class sklearn.preprocessing.OneHotEncoder. Consider the following example with the arrays of roles in a company:

>>> from sklearn.preprocessing import OneHotEncoder
>>> roles = np.array([('Tom', 'manager'),
...                   ('Mary', 'developer'),
...                   ('Ann', 'recruiter'),
...                   ('Jim', 'developer')])
>>> roles
array([['Tom', 'manager'],
       ['Mary', 'developer'],
       ['Ann', 'recruiter'],
       ['Jim', 'developer']], dtype='<u9')>>> encoder = OneHotEncoder()
>>> encoded_roles = encoder.fit_transform(roles[:, [1]])
>>> encoded_roles.toarray()
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])</u9')>

In the example above, the first column of the object encoded_roles indicates if each employee is a developer. The second and fourth employee (Mary and Jim) are. The second column is related to the position of manager. Only the first employee (Tom) has this position. Finally, the third column corresponds to the recruiter, and the third employee (Ann) is the one.

Dimensionality Reduction

Dimensionality reduction involves the selection or extraction of the most important components (features) of a multidimensional dataset. Scikit-learn offers several approaches to dimensionality reduction. One of them is the principal component analysis or PCA.

Model Selection

When training and testing machine learning models, you need to split your datasets randomly into training and tests sets. This includes both the inputs and their corresponding outputs. The function sklearn.model_selection.train_test_split() is useful in such cases:

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> x, y = np.arange(1, 21).reshape(-1, 2), np.arange(3, 40, 4)
>>> x
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12],
       [13, 14],
       [15, 16],
       [17, 18],
       [19, 20]])
>>> y
array([ 3,  7, 11, 15, 19, 23, 27, 31, 35, 39])
>>> x_train, x_test, y_train, y_test =\
...     train_test_split(x, y, test_size=0.4, random_state=0)
>>> x_train
array([[ 3,  4],
       [13, 14],
       [15, 16],
       [ 7,  8],
       [ 1,  2],
       [11, 12]])
>>> y_train
array([ 7, 27, 31, 15,  3, 23])
>>> x_test
array([[ 5,  6],
       [17, 18],
       [ 9, 10],
       [19, 20]])
>>> y_test
array([11, 35, 19, 39])

In addition to performing ordinary dataset splits, scikit-learn provides the means to implement cross-validation, tune the hyper-parameters of your models with the grid search, calculate many quantities that show the performance of a model (e.g., the coefficient of determination, mean squared error, explained variance score, confusion matrix, classification report, f-measures, and many more).

Datasets

Scikit-learn provides several datasets suitable for learning and testing your models. These are mostly well-known datasets. They are large enough to provide a sufficient amount of data for testing models, but also small enough to enable acceptable training duration.

For example, the function sklearn.datasets.load_boston() returns the data about the prices of houses in the Boston area (the prices aren’t updated!). There are 506 observations, while the input matrix has 13 columns (features):

>>> from sklearn.datasets import load_boston
>>> x, y = load_boston(return_X_y=True)
>>> x.shape, y.shape
((506, 13), (506,))

This dataset is suitable for multi-variate regression.

The other example is the dataset related to wine. It can be obtained with the function sklearn.datasets.load_wine():

>>> from sklearn.datasets import load_wine
>>> x, y = load_wine(return_X_y=True)
>>> x.shape, y.shape
((178, 13), (178,))
>>> np.unique(y)
array([0, 1, 2])

This dataset is suitable for classification. It contains 13 features related to three different wine cultivators from Italy. There are 178 observations.

Regression

Scikit-learn has support for a variety of regression methods starting with linear regression and k-nearest neighbors, via polynomial regression, support vector regression, decision trees, etc. to the ensemble methods like random forest and gradient boosting. It also supports neural networks, but not nearly to the same extent as the specialized libraries like TensorFlow.

We’ll show the random forest regression here.

We usually begin our regression journey by importing the packages, classes, and functions we need:

>>> import numpy as np
>>> from sklearn.datasets import load_boston
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import train_test_split

The next step is to get the data to work with and split the set into the training and test subsets. We’ll use the Boston dataset in this article:

>>> x, y = load_boston(return_X_y=True)
>>> x_train, x_test, y_train, y_test =\
...     train_test_split(x, y, test_size=0.33, random_state=0)

Some methods require you to scale (standardize) your data, while with others, it’s optional. We’ll continue without scaling this time.

Now, we need to create our regressor and fit (train) it with the subset of data chosen for training:

>>> regressor = RandomForestRegressor(n_estimators=10, random_state=0)
>>> regressor.fit(x_train, y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

Once the model is trained, we check its score (the coefficient of determination) on the training set, and what’s more important on the test set, i.e., with the data not used to fit the model:

>>> regressor.score(x_train, y_train)
0.9680930547240916
>>> regressor.score(x_test, y_test)
0.8219576562705848

A sufficiently good model can be used for predicting the outputs with some new input data x_new using the method .predict(): regressor.predict(x_new).

Classification

Scikit-learn performs classification in a very similar way as it does with regression. It supports various classification methods like logistic regression and k-nearest neighbors, support vector machines, naive Bayes, decision trees, s well as the ensemble methods like the random forest, AdaBoost, and gradient boosting.

This article illustrates how to use the random forest method for classification. The approach is very similar as in the case of regression. However, now we use the wine dataset, define the classifier, and evaluate it with the classification accuracy instead of the coefficient of determination:

>>> import numpy as np
>>> from sklearn.datasets import load_wine
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import train_test_split
>>> x, y = load_wine(return_X_y=True)
>>> x_train, x_test, y_train, y_test =\
...     train_test_split(x, y, test_size=0.33, random_state=0)
>>> classifier = RandomForestClassifier(n_estimators=10, random_state=0)
>>> classifier.fit(x_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
>>> classifier.score(x_train, y_train)
1.0
>>> classifier.score(x_test, y_test)
1.0

A sufficiently good model can be used for predicting the outputs with new input data using the method .predict(): regressor.predict(x_new).

Cluster Analysis

Clustering is a branch of unsupervised learning extensively supported in scikit-learn. In addition to k-means clustering, it enables you to apply affinity propagation, spectral clustering, agglomerative clustering, etc.

We’ll show k-means clustering in this article. When implementing it, be careful whether it makes sense to standardize or normalize your data and especially which measure of distance is suitable (in most cases it’s probably the Euclidean distance).

Again, we start with importing and getting data. This time, we’ll take NumPy and sklearn.cluster.KMeans:

A sufficiently good model can be used for predicting the outputs with some new input data x_new using the method .predict(): regressor.predict(x_new).

Classification

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> x = np.array([(0.0, 0.0),
...               (9.9, 8.1),
...               (-1.0, 1.0),
...               (7.1, 5.6),
...               (-5.0, -5.5),
...               (8.0, 9.8),
...               (0.5, 0.5)])
>>> x
array([[ 0. ,  0. ],
       [ 9.9,  8.1],
       [-1. ,  1. ],
       [ 7.1,  5.6],
       [-5. , -5.5],
       [ 8. ,  9.8],
       [ 0.5,  0.5]])

The next step is to scale the data, but it’s not always mandatory. However, in many cases, it’s a really good idea. However, once the data preprocessing is done, we create an instance of KMeans and fit it with our data:

>>> cluster_analyzer = KMeans(n_clusters=3, init='k-means++')
>>> cluster_analyzer.fit()
>>> cluster_analyzer.fit(x)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

Now, we’re ready to get the results such as the coordinates of the centers of clusters and the labels of the clusters each observation belongs to:

>>> cluster_analyzer.cluster_centers_
array([[ 8.33333333,  7.83333333],
       [-0.16666667,  0.5       ],
       [-5.        , -5.5       ]])
>>> cluster_analyzer.labels_
array([1, 0, 1, 0, 2, 0, 1], dtype=int32)

You can use the method .predict() to get the closest clusters for new observations.

Conclusions

This article shows the very basics of scikit-learn, a very popular data science and machine learning Python package. It’s one of the essential Python libraries for these purposes.

If you want to learn more about it, you can easily find many available resources. Duomly’s course on machine learning covers many functionalities of scikit-learn. As already mentioned, the official documentation is extensive and comprehensive. You should check it before applying classes or functions.

Thank you for reading!