DEV Community

Cover image for Supervised Learning
Oluwafemi Paul Adeyemi
Oluwafemi Paul Adeyemi

Posted on • Edited on

Supervised Learning

Image above from www.unsplash.com

Supervised Learning are machine learning methods that involve some features (input variables) and at least one target variable. A single feature can be used in machine learning, but the true picture is not likely to be seen. There are two types of supervised learning based on the quantitative (also called numeric targets and qualitative (also called categorical) targets. The target variables are the output variables. If the target is qualitative supervised learning is said to be classification. However, it said to be regression when the target is quantitative [1]. There exist many types of supervised learning algorithm for classification and regression. A number of them will be discussed below.

1. Linear Regression:

The Linear Regression model is actually similar to the linear function: y=mx + cy = mx \ + \ c , usually taught in high school, where y is the target (dependent) variable, x is the feature (independent variable). But in linear regression, an error term is accounted for, so that the straight line equation above becomes y=mx + c+ey = mx \ + \ c + e , which is called simple linear regression. Hence, y=m1x1 + m2x2+ ... +mixi+ c +eiy = m_1x_1 \ + \ m_2x_2 + \ ... \ + m_ix_i + \ c \ + e_i , is an extension of the same concept, which is called multiple linear regression where y is the target as before and the xix_i 's are the features (independent variables). Here the idea is that there is a linear relationship between the target and the features, i.e. as the independent variable(s) change in quantity or quality, there is an approximate linear increase or decrease in the target variable. If that is the case, a linear regression can be used to build a model for the given dataset [2]. This algorithm cannot be used with categorical targets.

2. Logistic Regression

The logistic regression predicts the probability that an object belongs to a class given that there are at least two classes using the supplied set of features. For instance, during an admission process, a school may use the gender of the students, weight, intelligent quotient, hobby e.t.c as features (input variables) to predict whether a candidate should be admitted or not. Those admitted may be said to belong to class A while those not admitted may be said to belong to class B. Hence the target can either be A or B. Suppose that the probability that an object belongs to class A is pip_i , then the probability that it belongs to class B is qi=1piq_i = 1 - p_i . Logistic regression uses the formula pi=11 + eβ0 + Σβixip_i = \frac{1}{1 \ + \ e^{-\beta_0 \ + \ \Sigma \beta i x_i}} and ln(p1p) =β0+Σβixiln \bigg( \frac{p}{1-p} \bigg) \ = \beta_0 + \Sigma \beta _i x_i . Hence, given a set of features, the logistic regression predicts two probabilities pip_i and qi=1piq_i = 1 - p_i . The larger of the two probabilities will determine the class of the object, so that class A is the predicted class if pip_i is larger than qi=1piq_i = 1 - p_i ; on the other hand class B is the predicted class if qi=1piq_i = 1 - p_i is larger. This is clearly a two class logistic regression called _binary logistic regression which is a simple case of multinomial logistic regression, where there can be more than two classes [3]. This algorithm cannot be used when the target of interest is quantitative.

3. Support Vector Machine (SVM)

This uses a hyperplanes to divide a given dataset into two categories in order to classify data points. A margin is obtained for each hyperplane such that a line is drawn from the hyperplane to the nearest points ( which are called support vectors ) on either sides of the hyperplane. see [4]. The hyperplane with the maximum margin is called the optimal hyperplane. Such hyperplane which is iteratively generated minimizes the classification error. SVM aims at obtaining a maximum marginal hyperplane (MMH) (which maximizes the margin) by separating the datasets into categories.[5] If we have a two class SVM, Hyperplanes are decision boundaries [6].

Image description
A two class SVM can be Linear, in which case, the data can be separated into two classes using a single straight line, as shown in the scatter plot above; such a SVM is said to be non-linear if it cannot be separated into two classes using a straight line. A kernel function is used with the SVM algorithm when the data is not linearly separable [4]. SVM can be used when the target variable has two or more classes. This algorithm can be used for regression and classification purposes.

4. Naive Bayes

This procedure uses the Bayes probability to predict classes. The bases probability is p(yi/x)=p(x/yi)i=0np(x/yi)p(yi)p(y_i/x)= \frac{p(x/y_i)}{ \sum_{i=0}^n{p(x/y_i)p(y_{i})}} where p(yi/x)p(y_i/x) is the probability that the target falls into the jth class given a set of features x. Hence p(yi/x)p(y_i/x) is calculated for all the classes and the image is classified as y^=argmax p(yi/x)\hat{y} = argmax \ p(y_i/x) . There are three basic types of Naive Bases, viz: bernoulli, multinomial and gaussian. The bernoulli and multinomial naive bayes are used when we have categorical features. The bernoulli is the simplest case of the multinomial naive bayes. The gaussian naive bayes is used for continuous features; in this case p(x/yi)=1σi2πe12σ(xμi)2p(x/y_i) = \frac{1}{ \sigma_i \sqrt{2\pi} }e^{ - \frac{1}{ 2\sigma}( x- \mu_i)^2} is first determined for all the variables and then x’s. [7]. This algorithm can only be used for classification purposes.

5. Decision Tree

Decision Tree is a tree which uses conditions (internal nodes or decision nodes) to split a tree (drawn upside down, with its root up) into branches (edges) until the end of a branch does not split anymore, i.e. until a leaf (decision) is arrived. The order of features, for building the tree can be selected using information index or gini index [5]. This algorithm can be used for classification and regression purposes.

6. Random Forest

This is an extension of the decision tree - as you know, a forest is made up trees. Here, a number of decision trees are
developed and the average value of the outcomes from the trees are taken for a regression task while the modal outcome is taken for a classification task. This algorithm uses a bagging scheme which can be one of two forms. The first kind of bagging called bootstrap aggregating random forest or bagging random forest, allows for the selection of a subset of the training data (i.e the training dataset is repeatedly sampled) without replacement and building a tree from each subset. The second kind of bagging allows for the selection of a subset of features for each candidate split ( i.e each tree to be built, where each tree is built from a subset of the training data ), hence it is called feature bagging random forest. The idea behind random forest is that sampling a subset of features or of the training data or both reduces the prediction variance of the model, since sub-sampling brings about less correlated trees. See [5], [7] and [8]. This algorithm can be used for classification
and regression purposes.

7. Extremely Randomized Trees

This is an extension of the random forest algorithm. Here, each tree is built using all the training data but with only a subset of the features and splitting of nodes is randomly done. See [7] and [8]. This algorithm can be used for classification and regression purposes.

8. K-Nearest Neighbours

This classifies a data point by selecting k neighbours which are the nearest to this data point, based on the measured distance between them and the data point of interest. Meaning that the smaller the distance, the closer the neighbour. Suppose that a neighbours belong to class A while k - a belong to class B and k - a > a, then the data point to be classified belongs to class B. This algorithm is non-parametric and very slow with large number of datasets because it calculates the distance between the point and every other points around it. This algorithm can be used for classification and regression purposes [7].

Next: SL for Classification using Python and R

Previous: Introduction to Machine Learning

References

  1. Liu, Q., & Wu, Y. (2012). Supervised Learning. in: Seel, N.M. (eds) Encyclopedia of the Sciences of Learning. Springer, Boston MA. https://doi.org/10.1007/978-1-4419-1428-6_451
  2. Maulad D.H. & Abdulazez A.M. (2020). A Review of Linear Regression Comprehensive in Machine Learning. Journal of Applied Science and Technology Trends, 1(4), 140 - 147
  3. Peng J. (2002). An introduction to Logistic Regression Analysis and Reporting.The Journal of Educational Research, 96(1), 3 - 14, DOI:10.1080/00220670209598786
  4. Ruscica, T. (2019, November 23). Python Machine Learning & AI Mega Course - Learn 4 Different Areas of ML and AI [Video]. YouTube. https://www.youtube.com/watch?v=WFr2WgN9_xE
  5. Tutorialspoint (2019), Machine Learning with Python. Tutorialspoint.
  6. Wikipedia(2023, April 29). In Wikipedia. https://en.wikipedia.org/wiki/Decision_boundary#:~:text=A%20decision%20boundary%20is%20the,are%20not%20always%20clear%20cut.
  7. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, Fabian, Mueller, A.,Grisel, O., … Ga"el Varoquaux. (2013). API design for machine learning software: experiences from the scikit learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122) (Check the Documentation User Guide)
  8. Wikipedia (2023, July 25). In Wikipedia. https://en.wikipedia.org/wiki/Random_forest

Top comments (0)