DEV Community

Cover image for Machine learning with Julia – How to Build and Deploy a Trained AI Model as a Web Service
Andrey Germanov
Andrey Germanov

Posted on • Edited on

Machine learning with Julia – How to Build and Deploy a Trained AI Model as a Web Service

Table of contents

Introduction
About you
Why Julia?
Install Julia and Jupyter notebook support
Julia basics
    Linear algebra
    Working with datasets
    Vizualizing data
Overview of Titanic machine learning problem
Prepare the training data for machine learning
    Fix missing values
    Fix non-numeric data
    Visual analysis
Train machine learning model
Make predictions and submit them to the Kaggle
Deploy the model to production
    Export the model to a file
    Create the frontend
    Create the backend
Conclusion

Introduction

Julia is a general purpose programming language well suited for numerical analysis and computational science. Sometimes it's stated as a future of machine learning and the most natural replacement for Python in this field.

This article introduces Julia language, and it's ecosystem, shows how to use it to solve a Titanic machine learning competition and submit it to the Kaggle. In addition, it will show how to deploy the created machine learning model to production as a web service and create a web interface to send prediction requests to this service from a web browser.

By the end of the article, you will create a simple AI-powered web application that can be used as a template for creating more complex Julia ML solutions.

About you

This is not a book, but only an article. That is why it can't cover everything and assumes that you already have some base knowledge to get the most from reading it. It is essential that you are familiar with Python machine learning and understand how to train machine learning models using Numpy, Pandas, SciKit-Learn and Matplotlib Python libraries. Also, I assume that you are familiar with machine learning theory: types of machine learning problems like regression and classification, the concept and process of Supervised machine learning (fit/predict and evaluate quality using metrics) and common models used for it, including Random Forest Classifier, and it's implementation in SciKit-Learn Python library. Additionally, it would be great if you previously participated in Kaggle competitions, because to understand and run all code of this article you need to have an account on https://kaggle.com.

There are a lot of books and articles already written, and many courses already released about topics described above. In this article I only show how to create, train and deploy basic machine learning model using Julia, without diving to theoretical aspects of ML and AI.

Why Julia?

For a long time, Python known as a standard for data science and machine learning because of it simplicity and great set of libraries and tools. Among others there are great libraries as Numpy to do linear algebra with vectors and matrices, Pandas to manipulate datasets, Matplotlib for data visualizations and Scikit-Learn that provides a uniform interface to work with well-known machine learning models. Furthermore, the Jupyter Notebooks that allows to write and run Python code online right in a web browser make a comfortable environment for data researchers to design and implement the whole machine learning cycle even if they are not very experienced in programming.

However, all this is good to research in laboratories, but at some step need to go to production and at this moment things change dramatically. The Python was created in early nineties and never supposed to be fast. It's kernel, never assumed to be used for new modern technologies like distributed computing. That is why, to make complex ML tasks production ready, a lot of third party dependencies should be installed and a lot of tricks should be made with that Python code to speed it up. A few companies even rewrite or convert Python machine learning models before deploying them to production in faster languages like C++.

The Julia aimed to resolve these problems. This is what the authors wrote about reasons of creating the Julia:

We are greedy: we want more. We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

Source: The Julia blog.

So, from ML perspective, the Julia got the best from two worlds. It's aimed to be as fast as C and as simple as Python. In addition, it has replacements for all libraries, that Python data scientists used to use in their work:

Purpose Python Julia
Linear algebra Numpy Built in arrays, LinearAlgebra package
Work with datasets Pandas DataFrames.jl
Data visualization Matplotlib Plots.jl
Classic Machine learning SciKit-Learn MLJ.jl, ScikitLearn.jl, BetaML.jl
Neural Networks TensorFlow or Pytorch Flux.jl, BetaML.jl

Read more about why Julia is a great choice for machine learning here.

Furthermore, Julia has a module to support Jupyter Notebooks, so you can write Julia code there the same as on Python. All this makes the Julia ready to do machine learning tasks, including Kaggle competitions, at the same environment as by using Python. Let's install this environment and introduce some Julia ML basics.

Install Julia and Jupyter notebook support

To install Julia, follow this link: https://julialang.org/downloads/, download a Julia package for your operating system and run it. After successful installation, you will be able to run the julia command to enter the Julia REPL environment. Here, you can write and run Julia code. To exit from REPL, enter exit() command.

Also, you can write your code in any text editor and save to files with .jl extension. Then you can run your Julia programs by this command:

julia <filename>.jl
Enter fullscreen mode Exit fullscreen mode

In addition, you can use VSCode to develop on Julia. It has a great extension for this: https://www.julia-vscode.org/.

However, the best option to develop machine learning and data science solutions is Jupyter Notebook, so, ensure that it's installed before continue. Then, install Jupyter support for Julia package using REPL:

  • Enter REPL using the julia command
  • Import the Pkg module
using Pkg
Enter fullscreen mode Exit fullscreen mode
  • Install the IJulia package
Pkg.add("IJulia")
Enter fullscreen mode Exit fullscreen mode
  • Exit the REPL by exit() command

Then you can run Jupyter and create notebooks with Julia support. For your convenience, the next video shows how to install Julia and integrate it to Jupyter on macOS (assuming that Jupyter itself already installed).

Sometimes the julia command does not work in terminal after installation on MacOS. You can use the following workaround to fix this: https://discourse.julialang.org/t/how-can-i-be-able-to-use-binary-command-julia-in-mac-osx-terminal/22270

Julia basics

Julia has a simple syntax. If you're familiar with Python, then it will be easy to start writing on Julia. You can read more about basic Julia syntax in this article. Here I will only cover features that required for machine learning and only the features which will be used to solve the Titanic Kaggle competition. To learn more about each of these libraries and modules, I will provide useful links.

Create new Jupyter Notebook to enter and run all code samples below.

Linear algebra

Basic linear algebra features already integrated to Julia standard library. Each 1D array is a vector, and each 2D array works as a Numpy array by default. You do not need to include any additional packages for it. For example, if you write and run this code:

A = [
    [1 2 3]
    [4 5 6]
    [7 8 9]
]
B = [
    [7 8 9]
    [4 5 6]
    [1 2 3]
]

A*B
Enter fullscreen mode Exit fullscreen mode

it will do a matrix multiplication and will output the following result:

3×3 Matrix{Int64}:
 18   24   30
 54   69   84
 90  114  138
Enter fullscreen mode Exit fullscreen mode

For additional features, you can import a LinearAlgebra module.

using LinearAlgebra
Enter fullscreen mode Exit fullscreen mode

Then, you can use such functions as det, tr or inv with matrices to get their determinants, traces or inverse matrix:

using LinearAlgebra

A = [
    [1 2 3]
    [4 5 6]
    [7 8 9]
]
println("Determinant: ",det(A))
println("Trace: ",tr(A))
println("Inverse: ")
inv(A)
Enter fullscreen mode Exit fullscreen mode

Find more about linear algebra features in the LinearAlgebra module documentation.

Working with datasets

To work with datasets, you have to install an external Dataframes.jl module. In addition, to load and save datasets to CSV files, you have to add CSV.jl module.

Julia package manager implemented as a Pkg module, so, you have to import it and then use the add method to install required packages. Run this in your Jupyter notebook to install these packages.

using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Enter fullscreen mode Exit fullscreen mode

Then, you can import installed modules to your program:

using DataFrames, CSV
Enter fullscreen mode Exit fullscreen mode

DataFrames module imports DataFrame data type, that you will use to construct datasets and manipulate data frame objects.

Create a data frame

This is how you can create a data frame with two columns:

df = DataFrame(name=["Julia", "Robert", "Bob","Mary"], 
age=[12,15,45,32])
Enter fullscreen mode Exit fullscreen mode

This code will create and output the following dataset:

Image description

Select data from a data frame

To select data from a data frame, you can use the array syntax:

df[<rows>,<columns>]
Enter fullscreen mode Exit fullscreen mode

You should specify range of rows to select in <rows> and range of columns to select in <columns>. This you can use to select first three rows and only the "age" column:

subs = df[1:3,"age"]
Enter fullscreen mode Exit fullscreen mode

Important to note that array numbering in Julia starts with 1, not with 0 as in most other languages. To select the first three rows and all columns, you can run this:

subs = df[1:3,:]
Enter fullscreen mode Exit fullscreen mode

Image description

Also, to select a single column, you can use dot syntax:

names = df.name
Enter fullscreen mode Exit fullscreen mode

Image description

As you see, each column is a native Julia array (vector).

You can use conditions to specify row ranges. For example, this can be used to select all persons from dataset that older than 15 years:

older = df[df.age .>15,:]
Enter fullscreen mode Exit fullscreen mode

Image description

Sort data in a data frame

To sort data in a data frame, you can use the sort function. This will sort the dataset by age in ascending order:

sort(df,"age")
Enter fullscreen mode Exit fullscreen mode

Image description

and next code will sort it in descending order:

sort(df,"age",rev=true)
Enter fullscreen mode Exit fullscreen mode

Image description

Add columns to a data frame

To add a new column, just use dot syntax:

df.sex = ["female","male","male","female"]
Enter fullscreen mode Exit fullscreen mode

This added the sex column for persons to the data frame.

Image description

Remove columns from a data frame

A select function can be used for more complex data extraction from frames. In particular, it can be used to extract all columns except specified, which is equal to removing these columns:

new_df = select(df,Not("sex"))
Enter fullscreen mode Exit fullscreen mode

This code returns a new data frame by selecting all columns from the original except sex.

Group and summarize data in data frame

A groupby and combine functions are used to group data and show summary information for each group. The former used to group data by specified field or fields and the latter used to add summary columns to it, like number of rows in each group or average value of some column in the group. Next code groups data by sex, calculates number of rows in each group and adds it as a "count" column:

group_df = groupby(df,"sex")
combine(group_df,nrow => "count")
Enter fullscreen mode Exit fullscreen mode

Image description

So, the first line of this code creates a GroupDataFrame object with rows, grouped by "sex". The second line creates the "count" column with count of items in each group. There are 2 females and 2 males in this dataset.

Also, a custom function can be used to calculate summary data. For example, this can be used to add both row counts and average ages for each group:

combine(group_df, 
    nrow => "count", 
    "age" => ((rows) -> sum(rows)/length(rows)) => "Average Age"
)
Enter fullscreen mode Exit fullscreen mode

Image description

This code adds the "Average Age" column that produced from values of "age" column by applying to it custom anonymous function, that calculates average of values in this group.

It were just a few percents of all possible manipulations that you can do with data using DataFrames.jl library. Read more about it in the documentation.

Vizualizing data

Using Plots.jl, you can create a lot of different graphs to analyze your data, similar to Matplotlib or Seaborn in Python. To use it, you have to install the Plots package to your notebook and import it:

using Pkg
Pkg.add("Plots")
using Plots
Enter fullscreen mode Exit fullscreen mode

Let me provide a few examples of graphs.

Line chart

plot(
    [1,2,3,4,5],
    [3,6,9,15,16],
    title="Basic line chart",label="Line"
)
Enter fullscreen mode Exit fullscreen mode

Image description

Scatter plot

plot(
    [1,2,3,4,5],
    [3,6,9,15,16],
    title="Basic scatter plot",
    label="Data",
    seriestype="scatter"
)
Enter fullscreen mode Exit fullscreen mode

Image description

Bar chart

The next code generates a bar chart from the df dataset that was created earlier.

plot(
    df.name,
    df.age,
    title="Ages",
    label=nothing,
    seriestype="bar"
)
Enter fullscreen mode Exit fullscreen mode

Image description

There are much more that you can do using Plots.js. Read more about it's features in the documentation.

After this short overview of basic data science features of Julia, it's time to create and train the first machine learning model and evaluate its quality on the competition.

Overview of Titanic machine learning problem

The "Titanic - Machine Learning from Disaster" is one of the first educational machine learning problems that you could see in books, articles or courses. In this task you are provided with a dataset of data about Titanic passengers. Each passenger data includes an ID, name, sex, ticket cost, ticket class, cabin number, port of embarkation and number of family members. For passengers in this dataset is known did they survive or not in "Survived" column. If the passenger survived, the value is 1, if not then 0. Formally, this is called a labeled or training dataset. All data columns except one called the "feature matrix", and the "Survived" column called the "labels vector".

There is also the second dataset with the same data about other passengers but without "Survived" column. In other words, this dataset contains only features matrix, but do not have the labels vector. This is called the testing dataset. The task is to train a machine learning model on the training dataset and use this model to predict the "Survived" column values in the testing dataset or, in other words, predict the "labels vector" of the testing dataset based on its "features matrix".

The Kaggle competition is available here: https://www.kaggle.com/competitions/titanic

Image description

Read briefly the description, then, open "Evaluation" section to discover how the Kaggle will evaluate the predictions that you submit.

Prepare the training data for machine learning

The "Data" tab on the Kaggle competition page contains training and testing datasets in train.csv and test.csv files, along with descriptions for each data column.

Create new Jupyter notebook with Julia backend and download these files to the same folder with your notebook.

Load train.csv to Data Frame using CSV module:

# Add packages
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")

# Import modules
using DataFrames, CSV

# Load training data to data frame
train_df = CSV.read("train.csv", DataFrame)
Enter fullscreen mode Exit fullscreen mode

In case of errors, please check that train.csv file exists in a folder where you run your notebook.

If no errors, it will show first rows of the data:

Image description

As you see, this dataset has 891 rows and 12 columns. This is the basic data about passenger like "Name", "Sex" and "Age". In addition, we see the "Survived" column, with 0 if passenger did not survive and 1 if survived.

Let's see the summary information about this data using the describe function:

describe(train_df)
Enter fullscreen mode Exit fullscreen mode

Image description

This summary table shows info about each column. It shows min, max, mean and median of data in each of them. The basic goal of data preparation is to transform these columns to features matrix and labels vector. The labels vector is ready, this is the "Survived" column with numeric values. All other columns form the features matrix, and not everything ok with them.

Let's look at the nmissing and eltype for each column. The nmissing shows the number of missing values in the appropriate column, and the eltype shows the type of data values in them. The matrix should contain only numbers, but there are many columns of "string" data type. Also, the matrix should not have missing values, but we have some missing values in Age, Cabin and Embarked columns. Let's fix all this.

Fix missing values

As the previous table shows, Age, Embarked and Cabin columns contain missing values. The Embarked absents only in 2 rows, so we will not lose too much data if just remove these rows. DataFrames module has a handy dropmissing function that can be used for this:

train_df = dropmissing(train_df,"Embarked")
Enter fullscreen mode Exit fullscreen mode

This will remove all rows with missing values in the Embarked column.

The Age contains 177 missing values, and it's not a good idea to remove these rows, because we will lose about 20% of data in the dataset. So, let's just fill it with something, for example with median value. The median age is 28 as displayed in the description table. Let's use the replace function of DataFrames to replace missing ages to 28:

train_df.Age = replace(train_df.Age,missing=>28)
Enter fullscreen mode Exit fullscreen mode

The Cabin column contains 687 missing values, which is more than 50% of the dataset. It's too few data in this column to be useful for predictions. Also, it's difficult to predict which data should be in these rows if there are more data is missing than exists. So, let's just drop this column using select function:

train_df = select(train_df, Not("Cabin"))
Enter fullscreen mode Exit fullscreen mode

Finally, all missing data in the dataset has fixed.

Fix non-numeric data

As said before, all data should be encoded to numbers, but we have Name, PassengerId, Sex, Ticket and Embarked as strings.

The Name and the PassengerId values are unique for each passenger, and that is why they can't be used by ML model to split the data to categories or classify it. So, you can just remove these columns:

train_df = select(train_df,Not(["PassengerId","Name"]));
Enter fullscreen mode Exit fullscreen mode

For other string columns, is required to encode all text values to numbers. To do that, need to discover all unique values of these columns. Let's start from the Embarked:

combine(groupby(train_df,"Embarked"),nrow=>"count")
Enter fullscreen mode Exit fullscreen mode

Image description

This code grouped dataset by the Embarked column and showed all possible values and their counts. So, here there are "S", "C" and "Q" values only. It's easy to encode them as S=1, C=2 and Q=3. This can be simply done by the following replace function:

train_df.Embarked = Int64.(
    replace(train_df.Embarked, 
        "S" => 1, "C" => 2, "Q" => 3
    )
)
Enter fullscreen mode Exit fullscreen mode

Also, this code converted the column from "String" to "Int64" data type.

Then, repeat the same for the Sex column:

combine(groupby(train_df,"Sex"),nrow=>"count")
Enter fullscreen mode Exit fullscreen mode

Image description

and replace female=1 and male=2.

train_df.Sex = Int64.(
    replace(train_df.Sex, 
        "female" => 1, "male" => 2
    )
)
Enter fullscreen mode Exit fullscreen mode

Now it's time to see the summary info for the Ticket column:

combine(groupby(train_df,"Ticket"),nrow=>"count")
Enter fullscreen mode Exit fullscreen mode

Image description

Here we see that it has 680 different categories of tickets, which is more than 50% of data. However, we need to predict just two categories, either survived or not survived. Not sure that this data can help the model to make good predictions without additional processing to reduce the number of categories in this column. Although this goes beyond our current basic model, as an additional practice, you can play more with data in this column to improve prediction results, like, try to find how to group tickets to more general categories and encode these categories by unique numbers. For now, let's just remove this column:

train_df = select(train_df,Not("Ticket"))
Enter fullscreen mode Exit fullscreen mode

Now all string data is categorized, and all values replaced to category numbers. Let's describe the dataset again to ensure that all problems with data resolved:

describe(train_df)
Enter fullscreen mode Exit fullscreen mode

Image description

You can see that all columns contain only numeric data and there are no missing values in them.

Visual data analysis

Now, the dataset is ready to train a machine learning model on it. Let's visualize this data to find some relations in it.

using Plots

# Group dataset by "Survived" column
survived = combine(groupby(train_df,"Survived"), nrow => "Count")

# Display the data on bar chart
plot(
    survived.Survived, 
    survived.Count, 
    title="Survived Passengers", 
    label=nothing, 
    seriestype="bar", 
    texts=survived.Count
)

# Modify X axis to display text labels 
# instead of numbers
xticks!([0:1:1;],["Not Survived","Survived"])
Enter fullscreen mode Exit fullscreen mode

Image description

Here we see that 340 passengers survived. Now let's see how these passengers distributed by sex.

# Group dataset by Sex column 
# and show only rows where Survived=1
survived_by_sex = combine(
    groupby(
        train_df[train_df.Survived .== 1,:],
        "Sex"), 
    nrow => "Count"
)

# Display the data on bar chart 
plot(
    survived_by_sex.Sex, 
    survived_by_sex.Count, 
    title="Survived Passengers by Sex", 
    label=nothing, 
    seriestype="bar", 
    texts=survived_by_sex.Count
)

# Modify X axis to display text 
# labels instead of numbers
xticks!([1:1:2;],["Female","Male"])
Enter fullscreen mode Exit fullscreen mode

Image description

Interesting, there are two times more females survived than males in the training dataset. Now let's see the distribution of not survived passengers by ticket class.

# Group dataset by PClass column 
# and show only rows where Survived=0
death_by_pclass = combine(
    groupby(
        train_df[train_df.Survived .== 0,:],
        "Pclass"), 
    nrow => "Count")

# Display the data on bar chart 
plot(
    death_by_pclass.Pclass, 
    death_by_pclass.Count, 
    title="Dead Passengers by Ticket class", 
    label=nothing, 
    seriestype="bar", 
    texts=death_by_pclass.Count
)

# Modify X axis to display 
# text labels instead of numbers
xticks!([1:1:3;],["First","Second","Third"])
Enter fullscreen mode Exit fullscreen mode

Image description

This clearly shows that first and second class passengers had more chances to survive than third class ones.

Assuming that data in the training and the testing datasets distributed randomly, it's highly likely that a machine learning model trained on this data should predict that women in first or second class had much more chances to survive than others. Let's remember this finding to check this hypothesis at the end of the article, after train and deploy the ML model.

Finally, let's see the cleaned training dataset again:

train_df
Enter fullscreen mode Exit fullscreen mode

Image description

Now it really looks like a matrix, or, to be more precise, like a system of algebraic linear equations written in matrix form. Data in matrix format is exactly what the most machine learning algorithms expect to get as an input. Let's get started.

Train machine learning model

For machine learning, we will use SciKitLearn.jl library, which replicates SciKit-Learn library for Python. It provides an interface for commonly used machine learning models like Logistic Regression, Decission Tree or Random Forest. SciKitLearn.jl is not a single package but a rich ecosystem with many packages, and you need to select which of them to install and import. You can find a list of supported models here. Some of them are built-in Julia models, others are imported from Python. Also, the SciKitLearn.jl has a lot of tools to tune the learning process and evaluate results.

For this "Titanic" task, we will use the RandomForestClassifier model from the DecisionTree.jl package. Usually it works good for classification problems. Also, we will use the Cross Validation to calculate accuracy of model predictions from SciKitLearn.CrossValidation package. You have to install and import these packages before using them:

Pkg.add("DecisionTree")
Pkg.add("SciKitLearn")
using DecisionTree, SciKitLearn.CrossValidation
Enter fullscreen mode Exit fullscreen mode

Then we will implement the training process. First we need to split the training dataset to features matrix and labels vector, then we need to create the RandomForestClassifier model and train it using this data. Finally, we will evaluate a prediction accuracy of this model using cross_val_score function.

# Put "Survived" column to labels vector
y = train_df[:,"Survived"]
# Put all other columns to features 
# matrix (important to convert to "Matrix" data type)
X = Matrix(train_df[:,Not(["Survived"])])

# Create Random Forest Classifier with 100 trees
model = RandomForestClassifier(n_trees=100)

# Train the model, using features matrix 
# and labels vector
fit!(model,X,y)

# Evaluate the accuracy of predictions 
# using Cross Validation
accuracy = minimum(
    cross_val_score(model, X, y, cv=5)
)
Enter fullscreen mode Exit fullscreen mode

Image description

The cross validation splits X and y arrays to 5 parts (folds) and return the array of accuracies for each of these parts. Then the minimum function selects the worst accuracy from this array, which means that all others are better than the selected one. Finally, the achieved accuracy is more than 0.78, which is 78% for our training data. It's not bad, but does not guarantee that on the testing dataset the result will be the same. You can try to improve this value by selecting different models, or by tuning their hyperparameters. For example, you can increase the number of trees (n_trees) from 100 to 1000 or reduce to 10 and see how it will change the accuracy. After achieving the best result, it's time to use it for predictions.

Make predictions and submit them to the Kaggle

Now, when the model is ready, it's time to apply it to data from test.csv file which does not have the "survived" labels. First we need to load it and look the summary table as we did for training dataset:

test_df = CSV.read("test.csv",DataFrame)
describe(test_df)
Enter fullscreen mode Exit fullscreen mode

Image description

Here you can see the same problems with data: missing values and string columns. You need to apply exactly the same transformations to this data as you did in the training dataset, except removing any rows because the Kaggle requires that you do predictions for each row, so you can only fill missing values, but not remove the rows with them. Fortunately, the Embarked column does not have missing values, so there is no need to fix it. However, this dataset has a single missing value in the Fare column, but we did not have any missing values there in the training set. It's not a big problem, you can just replace this missing value by median 14.4542.

But first thing that needed to do, is to save the PassengerId column to separate variable. It will be required later for the Kaggle submission.

PassengerId = test_df[:,"PassengerId"]
Enter fullscreen mode Exit fullscreen mode

Then, apply all required data fixing:

# Repeat the same transformations as we did for training dataset
test_df = select(test_df,
    Not(
        ["PassengerId","Name","Ticket","Cabin"]
    )
)
test_df.Age = replace(test_df.Age,missing=>28)
test_df.Embarked = replace(
    test_df.Embarked,"S" => 1, "C" => 2, "Q" => 3
)
test_df.Embarked = convert.(Int64,test_df.Embarked)
test_df.Sex = replace(
    test_df.Sex,"female" => 1,"male" => 2
)
test_df.Sex = convert.(Int64,test_df.Sex)

# In addition, replace missing value
# in 'Fare' field with median
test_df.Fare = replace(
    test_df.Fare,
    missing=>14.4542
)
Enter fullscreen mode Exit fullscreen mode

After the testing dataset is clean, you can use the trained model to make predictions:

Survived = predict(model, Matrix(test_df)) 
Enter fullscreen mode Exit fullscreen mode

This code returns array of predictions for each row of testing dataset matrix and saves it to the Survived variable.

Now it's time to submit it to Kaggle. Before doing it, look again to "Evaluation" tab on the Kaggle Titanic competition page to see the required submission format:

Image description

The competition requires the CSV file with two columns: "PassengerId" and "Survived". You already have all this data. Let's create the data frame with these two columns and save it to CSV:

submit_df = DataFrame(PassengerId=PassengerId,Survived=Survived)
CSV.write("submission.csv",submit_df)
Enter fullscreen mode Exit fullscreen mode

The first line of this code constructs the submit_df data frame with the PassengerId column that was saved previously and the Survived column with predictions for each passenger ID. The second line saves this submit_df to the submission.csv file. This is how the content of this file looks:

Image description

Finally, go to the Kaggle competition page, press the "Submit Predictions" button, upload the submission.csv file and see your result. When I did this, I received the following:

Image description

The prediction accuracy is 0.76555 which is more than 76% and is close to the accuracy that was received on the training dataset. Not bad for the first time, but you can keep going: play with data, try different models, change their hyperparameters, surf Internet for articles and Jupyter notebooks of other people who solved the Titanic competition before. I know that it's possible to achieve up to 98% accuracy using various tricks with models and data.

Deploy the model to production

It's fun to play with machine learning on your computer, but it does not have any sense for the surrounding world. Usually, customers do not have Jupyter Notebooks and they do not train the models. They need to have a simple tools that will help them to make decisions based on predictions from data that they have. That is why the only really important thing is how your models will work in production. In this section, I will explain how to use Julia to create a web application that will load the machine learning model you trained to make predictions online in a web browser.

Export the model to a file

First, you need to save the model from the notebook to a file. For this you can use JLD2.jl module. This module used to serialize Julia object to HDF5-compatible format (which is well known by Python data scientists) and save it to a file.

Install and load the package to the notebook:

Pkg.add("JLD2")
using JLD2
Enter fullscreen mode Exit fullscreen mode

and then save the model variable to the titanic.jld2 file:

save_object("titanic.jld2", model)
Enter fullscreen mode Exit fullscreen mode

The work with Jupyter Notebook ended now. All next code should be written as a separate application. Create a folder for a new application, like titanic for example, and copy the titanic.jld2 file to it.

Now you can create a text file titanic.jl which will contain a code of the web application that you will write soon. Use any text editor for this or VS Code with Julia extension. Enter the following to titanic.jl:

using JLD2, DecisionTree
model = load_object("titanic2.jld2")
survived = predict(model,[1 2 35 0 2 144.5 1])
println(survived)
Enter fullscreen mode Exit fullscreen mode

This code imported required modules first. As you see, just two modules required to run prediction process: the JLD2 to load the model object, and the DecisionTree to run predict function for the RandomForestClassifier. Then, the code loads the model from the file, then it makes predictions for a single row of data. The columns in this row should go in the same order as they passed from the dataset when trained the model: Pclass, Sex, Age, SibSp, Parch, Fare and Embarked. Finally, it prints the array of predictions. In this case, it will print the array with a single item, because only a single row of data passed to the model for predictions.

You can run this code using julia command:

julia titanic.jl
Enter fullscreen mode Exit fullscreen mode

If everything work ok, it should print [0] or [1] to the console depending on prediction result. If you receive errors, then perhaps you need to install JLD2 and DecisionTree packages using Julia REPL environment, as you did it in the Jupyter notebook.

Now, let's refactor this code to a function that will receive the row of data and return a survival prediction (either 0 or 1):

using JLD2, DecisionTree

# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
    model = load_object("titanic2.jld2")
    survived = predict(model,data)
    return survived[1]
end
Enter fullscreen mode Exit fullscreen mode

Create the frontend

The next step is to create a web interface, that will be used to collect the data for this function. This will look as displayed on the next screenshot:

Image description

With this interface, the user can enter the data about a passenger, then press the "PREDICT" button and discover could the passenger with this data survive on Titanic or not. This is an HTML code of this web page:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Titanic</title>
</head>
<body>
    <table>
        <tbody>
            <tr>
                <td>Ticket class</td>
                <td>
                    <select id="pclass">
                        <option value="1">1</option>
                        <option value="2">2</option>
                        <option value="3">3</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Sex</td>
                <td>
                    <select id="sex">
                        <option value="1">Female</option>                        
                        <option value="2">Male</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Age</td>
                <td>
                    <input id="age" type="number"/>
                </td>
            </tr>
            <tr>
                <td># of Siblings/Spouces</td>
                <td>
                    <input id="sibsp" type="number"/>
                </td>
            </tr>
            <tr>
                <td># of Parents/children</td>
                <td>
                    <input id="parch" type="number"/>
                </td>
            </tr>
            <tr>
                <td>Fare</td>
                <td>
                    <input id="fare"/>
                </td>
            </tr>
            <tr>
                <td>Embarked</td>
                <td>
                    <select id="embarked">
                        <option value="1">S</option>
                        <option value="2">C</option>
                        <option value="3">Q</option>
                    </select>
                </td>
            </tr>
            <tr>
                <td>Survived</td>
                <td id="survived"></td>
            </tr>
            <tr>
                <td colspan="2">
                    <div>
                        <button id="submit" type="button">PREDICT</button>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
    <script>
        document.getElementById("survived").innerHTML = "";
        document.getElementById("submit").addEventListener("click",async() => {
            response = await fetch("http://localhost:8080",{
                method:"POST",
                body: JSON.stringify({
                    "pclass":parseInt(document.getElementById("pclass").value),
                    "sex":parseInt(document.getElementById("sex").value),
                    "age":parseFloat(document.getElementById("age").value),
                    "sibsp":parseInt(document.getElementById("sibsp").value),
                    "parch":parseInt(document.getElementById("parch").value),
                    "fare":parseFloat(document.getElementById("fare").value),
                    "embarked":parseInt(document.getElementById("embarked").value),
                })
            });
            const survivedCode =  parseInt(await response.text());
            document.getElementById("survived").innerHTML = survivedCode ? "YES" : "NO"
        })
    </script>
    <style>
        input,select {
            width:100%;
        }
        td {
            padding:5px;
        }
        td > div {
            text-align: center;
        }
        #survived {
            font-weight: bold;
            color:green;
        }
    </style>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

Create an index.html file in the same folder and copy this code to it. The HTML part of the file contains a simple form with all data fields. As you see, all values encoded to the same numbers as we did with data in training and test datasets. Then, the JavaScript part of this code defines the handler of the "PREDICT" button. When the user clicks on it, the script collects all entered data and saves it as a JSON string. Then it makes an AJAX request to the web service running on port 8080 of the localhost (which have not created yet) and sends this JSON to the web service. So, the web service should be able to receive HTTP POST requests with JSON body in the following format:

{
     "pclass": 1,
        "sex": 1,
        "age": 32,
      "sibsp": 5,
      "parch": 6,
       "fare": 123.44,
   "embarked": 1
}
Enter fullscreen mode Exit fullscreen mode

Create the backend

Now it's time to modify the titanic.jl file to make it work as a web server, that can display the index.html page, receive POST request from it, parse the body of this request to JSON, make prediction based on this JSON data and return this prediction to the web page.

Creating a web server on Julia is the same simple as on Python, Go, or Node.js. By using HTTP.jl package, you can create and run a web server by a single line of code:

using HTTP

HTTP.serve(handler,8080)

function handler(req)
    # handle HTTP request
end
Enter fullscreen mode Exit fullscreen mode

The HTTP.serve function runs the web server on the specified port. Each time when the web server receives a client request, it calls the specified handler function and sends an HTTP request object to it as a req argument. The function should read this request, process it and write a response to the calling client.

The req.url field contains the URL of the received request, the req.method field contains request method, like GET or POST, the req.body field contains the POST body of the request in binary format. HTTP request object contains much other information. All this you can find in HTTP.jl documentation. Our web application will only check the request method. If the received request is a POST request, it will parse req.body to JSON object and send the data from this object to the isSurvived function to make a prediction and return it to the client browser. For all other request types, it will just return the content of the index.html file, to display the web interface. This is how the whole source of titanic.jl web service looks:

using JLD2, DecisionTree

# Returns 1 if a passenger with
# specified 'data' survived or 0 if not
function isSurvived(data)
    model = load_object("titanic.jld2")
    survived = predict(model,data)
    return survived[1]
end

using HTTP,JSON3

function handle(req)
    if req.method == "POST"
        form = JSON3.read(String(req.body))
        survived = isSurvived([
            form.pclass
            form.sex
            form.age
            form.sibsp
            form.parch
            form.fare
            form.embarked
        ])
        return HTTP.Response(200,"$survived")
    end
    return HTTP.Response(200,read("./index.html"))
end

HTTP.serve(handle, 8080)
Enter fullscreen mode Exit fullscreen mode

Before running it, you need to install the HTTP.jl package by running Pkg.add("HTTP") in the julia REPL environment.

The web service code goes right after isSurvived function. First, the required modules imported: HTTP to create a web server and JSON3 to parse JSON from request body. Then, the handler function defined. The function checks request method of received requests and if it equals to POST, it converts the stringified JSON body of this request to the form object. Then, using fields of this object, the isSurvived function called. It's important to put array items in correct order here. Finally, the prediction result is returned to the client using the HTTP.Response function.

For all other request types, the function returns the body of index.html file in the HTTP.Response(200,read("./index.html")) line.

Finally, HTTP.serve function starts a web server on port 8080 that waits for the HTTP requests and handles them using the handle function, defined above.

Now you can run this by typing julia titanic.jl in terminal or by pressing Ctrl+F5 in VSCode. Then you can access the web interface from a web browser on http://localhost:8080 and play with the service by entering data in the form, press the PREDICT button and see either YES or NO on the Survived line depending on the prediction result. You can check the hypothesis which we made from bar charts: the women in 1 or 2 class have more chances to survive than others.

Conclusion

In this article, I introduced the Julia programming language along with its ecosystem and explained why it's so great for machine learning. I showed how to set up a comfortable development environment and gave a brief overview of the common Julia modules used for data science. Then I guided you through the process of training the machine learning model for the Titanic competition and showed how to make predictions and submit them to the Kaggle platform for scoring. Finally, I showed how to export this model to an external application, create the web service with this model and the web interface to enter data to the form and predict could the human with this data survive on the Titanic or not.

For all topics that explained briefly, I provided the links with more thorough documentation. In addition, I would highly recommend reading the Julia Data Science online book and learn the great set of machine learning examples in Julia Academy Data Science GitHub repository.

See the source code of this article including the Jupyter Notebook and the web service in this repository:

https://github.com/AndreyGermanov/julia_titanic_model

Have a fun coding and never stop learning!

Subscribe to the newsletter on my website: https://germanov.dev/#newsletter and follow me on social networks to know first about new articles like this one and other software development news:

LinkedIn: https://www.linkedin.com/in/andrey-germanov-dev/
Twitter: https://twitter.com/GermanovDev
Facebook: https://www.facebook.com/AndreyGermanovDev

Top comments (4)

Collapse
 
sylvaticus profile image
Antonello Lobianco

Hello, thanks.

This is how you can do the same using BetaML equivalent imputer / random forest :

cd(@__DIR__)  
using Pkg             
Pkg.activate(".")  
#Pkg.instantiate()

# Import modules
using DataFrames, CSV, BetaML

# Load training data to data frame
train_df  = CSV.read("data/train.csv", DataFrame)
train_df  = select(train_df,Not(["PassengerId","Name","Ticket"]));
y         = train_df[:,"Survived"]
X_partial = Matrix(train_df[:,Not(["Survived"])])
impMod    = RFImputer(n_trees=60,recursive_passages=2)
X         = fit!(impMod,X_partial)

sampler   = KFold(nsplits=5,nrepeats=2);
(μ,σ) = cross_validation([X,y],sampler) do trainData,valData,rng
                (xtrain,ytrain) = trainData; (xval,yval) = valData
                model = RandomForestEstimator(n_trees=100,force_classification=true)
                fit!(model,xtrain,ytrain)
                     =  predict(model,xval)  
                return accuracy(collect(yval),)
        end # (0.826, 0.038)

# Actual model training
model            = RandomForestEstimator(n_trees=100,force_classification=true)
                =  fit!(model,X,y)
inSampleAccuracy = accuracy(y,) # 0.9472

# Saving models
model_save("titanic_model.jld";imputation_model=impMod,titanic_model=model)

# Test (possibly in production...)
test_df         = CSV.read("data/test.csv", DataFrame)
impModel, model = model_load("titanic_model.jld","imputation_model","titanic_model")
PassengerId     = test_df[:,"PassengerId"]
test_df         = select(test_df,Not(["PassengerId","Name","Ticket"]));
Xtest_partial   = Matrix(test_df)
Xtest           = predict(impModel,Xtest_partial)
Survived        = mode(predict(model,Xtest))
submit_df       = DataFrame(PassengerId=PassengerId,Survived=Survived)
CSV.write("submission_betaml.csv",submit_df) # 0.766
Enter fullscreen mode Exit fullscreen mode

Unsurprising the accuracy is the same, as the general algorithm is the same (although the code implementation of the underlying libraries is very different)

Collapse
 
andreygermanov profile image
Andrey Germanov • Edited

Great, thank you! Also, I will try to use Neural network for this to compare results.
Perhaps will write article about this too.

Collapse
 
sylvaticus profile image
Antonello Lobianco • Edited

Here it is...

Predicting titanic survivals using BetaML NeuralNetworkEstimator.

Note that compared to Random Forests, that "digest" any kind of data without too many complains, here we need to clean a bit the data...


cd(@__DIR__)  
using Pkg             
Pkg.activate(".")  
#Pkg.instantiate()

# Import modules
using DataFrames, CSV, BetaML

# Load training data to data frame
train_df  = CSV.read("data/train.csv", DataFrame)
y         = collect(train_df.Survived) .+1

encoder_y, encoder_sex, encoder_embarked = OneHotEncoder(), OneHotEncoder(), OneHotEncoder()
y_oh        = fit!(encoder_y,y)
sex_oh      = fit!(encoder_sex,train_df.Sex) 
embarked_oh = fit!(encoder_embarked,train_df.Embarked)

train_df  = select(train_df,Not(["PassengerId","Name","Ticket","Cabin","Sex","Embarked","Survived"]));
X_partial = hcat(fit!(Scaler(),Matrix(train_df)),sex_oh,embarked_oh)
impMod    = RFImputer(n_trees=60,recursive_passages=2)
X         = fit!(impMod,X_partial)
(N,D)     = size(X)
DY        = size(y_oh,2)

sampler   = KFold(nsplits=5,nrepeats=2);
(μ,σ) = cross_validation([X,y],sampler) do trainData,valData,rng
                (xtrain,ytrain) = trainData; (xval,yval) = valData
                innerD = Int(round(D*2)) # These two hyper-parameters you would most likely want to tune by running different cross validations 
                epochs = 100
                layers  = [DenseLayer(D,innerD,f=relu),DenseLayer(innerD,DY,f=relu),VectorFunctionLayer(DY,f=softmax)];
                model   = NeuralNetworkEstimator(layers=layers,opt_alg=ADAM(),epochs=epochs,verbosity=NONE)
                ytrain_oh = predict(encoder_y,ytrain)
                fit!(model,xtrain,ytrain_oh)
                ŷval_oh = predict(model,xval)
                return accuracy(yval,ŷval_oh) 
        end # (0.8192, 0.018)

# Actual model training
innerD  = Int(round(D*2))
epochs  = 100
layers  = [DenseLayer(D,innerD,f=relu),DenseLayer(innerD,DY,f=relu),VectorFunctionLayer(DY,f=softmax)];
model   = NeuralNetworkEstimator(layers=layers,opt_alg=ADAM(),epochs=epochs)
ŷ_oh    =  fit!(model,X,y_oh)
inSampleAccuracy = accuracy(y,ŷ_oh) # 0.8406
hcat(y,mode(ŷ_oh))

# Saving models
model_save("titanic_model_nn.jld"; encoder_y, encoder_sex, encoder_embarked, imputation_model=impMod, titanic_model_nn=model)

# Test (possibly in production...)
test_df       = CSV.read("data/test.csv", DataFrame)
encoder_y, encoder_sex, encoder_embarked, impModel, model = model_load("titanic_model_nn.jld","encoder_y","encoder_sex", "encoder_embarked", "imputation_model","titanic_model_nn")
PassengerId   = test_df[:,"PassengerId"]

sex_oh      = predict(encoder_sex,test_df.Sex) 
embarked_oh = predict(encoder_embarked,test_df.Embarked)

test_df       = select(test_df,Not(["PassengerId","Name","Ticket","Cabin","Sex","Embarked"]));
Xtest_partial = hcat(fit!(Scaler(),Matrix(test_df)),sex_oh,embarked_oh)
Xtest         = predict(impMod,Xtest_partial)

Survived      = mode(predict(model,Xtest))
submit_df     = DataFrame(PassengerId=PassengerId,Survived=(Survived .- 1))
CSV.write("submission_betaml_nn.csv",submit_df) # 0.78
Enter fullscreen mode Exit fullscreen mode
Thread Thread
 
andreygermanov profile image
Andrey Germanov • Edited

Great! Added the BetaML library to this article as a library for ML and neural networks. Will look at it in more detail.