What is Machine Learning?
Machine learning is one of the hottest topics in computer science in recent years. With the rise of software like ChatGPT by OpenAI, Copilot by Microsoft, and countless other large language models that are powered by machine learning, it's becoming ever more relevant in our daily lives. From helping us write emails to answering our questions about life and the pursuit of happiness, these chatbots seem almost omnipotent.
But how do we get from machine learning to large language models? How do we get from training a model that can, for example, identify handwritten digits to something as complex as ChatGPT? I'll be publishing posts every week that goes further in depth into answering these questions, but for now, let's start with the basics.
The Basics
Machine learning is exactly what it is: making a machine (or more specifically, a computer) learn. More precisely, it's writing code to train our computers so that it can be used to make predictions later down the line. The result of this training is called a model.
In the handwritten digits example, you might think that we can just use existing images to train a model, but we can't just use any image of number. Since we each have our own way of writing digits, we can't just train on a screenshot of a number and then hope it can extrapolate to our handwriting. The numbers for a specific font are rendered exactly the same way on a computer screen whereas each time we write a number, there are extremely minor differences that could convince a computer that the number we wrote down is a different one. We need to train our model using the same type of data we want it to predict on. In other words, we want to train our model with images of handwritten digits so that it can do the same with images it hasn't seen before.
Google Colab
To get started, we can head over to Google Colab, a notebook-based coding environment. What that means is that you can split your code into individual cells that you can re-run and you can view the output of each one.
Notice how you don't need to use the print
function to view output in this scenario. The return value from the last line of a cell gets outputted by default, but if you want to see output from other lines, then you'll have to use print
.
Training A Model With scikit-learn
To Recognize Digits
Getting Our Data
There's a popular library to get started with machine learning called scikit-learn
that's already installed in Colab. We can use this to train our model using the MNIST dataset of handwritten digits.
# Importing the necessary libraries and functions that will be used later
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
mnist = fetch_openml('mnist_784')
mnist.data
mnist.target
images = mnist.data.to_numpy()
plt.imshow(images[0].reshape(28, 28), cmap='gray')
Now that we have our images, let's proceed with building our model! Since we're classifying images to a certain digit and also just to keep things simple, we should use a classification algorithm. One of the most common algorithms of this type is k-nearest neighbors, which looks at data points that are near in terms of Euclidean distance (or some other distance metric) and outputs the classification that is most common among them. This is appropriate in our scenario because images that have similar pixel weights should represent the same number.
Running Our Classifier
To train our model and test it appropriately, we'll need to split this large dataset into smaller training and testing datasets. Luckily, scikit-learn
provides a function for us named train_test_split
to do just that! However, it requires that mnist.data
and mnist.target
be combined into one data structure first. We can join them together using the join
method provided by the pandas
library.
mnist.data.join(mnist.target)
X_train, X_test, y_train, y_test = train_test_split(images, mnist.target.to_numpy(), test_size=0.2)
The test size can be adjusted arbitrarily but usually 80/20 is a good split.
We can then use the KNeighborsClassifer
that we imported at the beginning to train our model using our training dataset and then test the accuracy of our model using the test dataset. This snippet looks at the 5 nearest neighbors but feel free to change the number as you see fit!
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
accuracy_score(y_test, model.predict(X_test))
This output means that our model had an accuracy of 97% with the test dataset. That's a pretty good score! And congratulations, you just built your first machine learning model!
Where To Go From Here?
We used the scikit-learn
library in this tutorial, but for more intensive machine learning applications that, for instance, require more customization of your parameters or utilization of compute resources, it would be better to use a library like PyTorch or Tensorflow. I'll be going over both in later posts!
In the meantime, thanks for sticking with me. See you in the next one!
Top comments (0)