Machine Learning
Solving a classification problem in FinTech with Node.js and TensorFlow.
Motivation
At Kontist we provide a banking app for freelancers. The user can select a category for each of their transactions. For example, ”Business expense,“ “Private,” ”Business income,“ ”Tax pay,“ and more. Based on this selection, we then do tax calculations for the freelancer to support his or her savings.
In the current user interface flow, the user selects a category from a list every time a new transaction comes in. To improve the user experience, we would like to automate the category selection. The naïve approach is to create manual rules like, “If the sender of a transaction was used in a transaction before, then just use the same category.” Obviously, this has some shortcomings:
- Imagine the sender “Amazon.“ One transaction could be “Private“, but the next one could be a ”Business expense“ and our approach would fail.
- How should we categorize transactions from new and unknown senders?
- You could refine the rules by including more data. For example, we could not only look at the sender but also at the transaction amounts. But adding more manual rules to improve the accuracy would make the code complex and unwieldy.
So instead the approach we took was to create a machine learning solution. First, we build a model and trained it with some existing transactions for which the category was known. Then we uses that model to make predictions about upcoming transactions.
Introduction to classification
Classification is a task that assigns a label to some data based on what was learned from previous data. In our case, the labels are categories (“Business expense,” ”Private,” ”Business income,“ ”Tax pay,“ et cetera) and the data are the transactions.
In general, the process looks like this:
- Define our model.
- Train the model with known data.
- Use the model to make predictions.
Choose the right inputs
Not all properties of a transaction help us with the classification. For example, it is obvious that some random UUID properties cannot help the model in making predictions. In fact, we found that only a couple of properties have any influence on the prediction at all. These properties, that we do use as an input for the training and prediction, are called “input features.” On the other hand, the categories are called the “output labels.“
Layers and Neurons
/neurons.png (Simplified overview)
Looking at this image we can see that each input feature corresponds to one neuron at the left, and each output label corresponds to one neuron at the right.
In between we have several neurons organized in multiple hidden layers. Neurons are connected from one layer to the next, each connection having a specific and custom weight. You could say the values (also called probabilities) of the output labels are just a sum of the neuron values multiplied by their weights. Put simply, training the model is a process of finding the correct weights for all connections between the neurons.
/neurons-weights.png (Sample weights; 62% of input data is predicted to be in the business expense category.)
Our setup
The backend is a Node.js and TypeScript environment. The transaction data comes from various sources, but we can access all of it via a PostgreSQL database.
Luckily, there is already a JavaScript binding for TensorFlow (called TensorFlow.js).
So, we can define a sequential model as described above. It consists of four layers. The first is the input layer, where we enter our features. This is implicitly added to the model. In addition, we have two hidden layers and a layer for the output labels.
import * as tf from "@tensorflow/tfjs-node";
const inputFeaturesCount = ...
const categoriesCount = ...
const model = tf.sequential();
const units = Math.ceil((inputFeaturesCount + categoriesCount) * 0.75);
model.add(
tf.layers.dense({
units,
inputShape: [inputFeaturesCount],
activation: "relu",
})
);
model.add(tf.layers.dense({ units }));
model.add(
tf.layers.dense({
units: categoriesCount,
activation: "softmax",
})
);
model.compile({
optimizer: "adam",
loss: "categoricalCrossentropy",
metrics: ["accuracy"],
});
Normalize everything
Before we can start to train our model, it is time to normalize the data; the input features must be numerical values.
For example, take the date of the booking, "2019-05-28 22:12." With the help of the moment.js library, this can be extracted into three input features:
const dayOfMonth = +moment(bookingDate).format("D");
const dayOfWeek = +moment(bookingDate).format("d");
const hour = +moment(bookingDate).format("H");
To avoid complications, we want all the values to be normalized between 0 and 1. Therefore, we divide all the values by their maximum value, adding an extra step.
Another part of the preparation for training is to evenly distribute the data. In our case, we have a lot more training data that is marked as "Business Purpose" than "Private." TensorFlow offers a nice way to handle that; it allows the user to set a class weight for each label corresponding to the distribution in the training data set. Note that these class weights are not to be confused with the actual weights of the connections between the neurons.
What does the crowd say?
Turns out that we have good input features which do not directly come from the transaction itself. We can have a look how the user in question or other users categorized transactions with the same IBAN in the past. This might give a strong indication of how to predict future transactions.
Training
Time to train our model. We take our training data, shuffle it, and split it into two parts.
- The actual training data (80%)
- Some validation data (20%)
First, TensorFlow uses the training data to try to find good weight values for the connections between the neurons. Training is a process of determining weight values so that the sum of the neuron values multiplied by their weights of connections will create good output label values.
The validation data will then be used to check if the training worked. We cannot use the training data to verify this; it would of course return perfect results since we just used it to create this model.
await model.fit(inputFeatureTensor, labelTensor, {
// ...
validationSplit: 0.2,
callbacks: [
tf.callbacks.earlyStopping({
monitor: "val_loss",
mode: "min",
}),
],
});
How does TensorFlow find these values? It iteratively applies a function to adjust the weights so that the discrepancy between the label results and the expected results is minimized. If the discrepancy is below a given threshold, training is complete.
Making predictions
We now have a model and can start making predictions. Our income data must be in the same format as our training data, meaning we must apply the same normalization.
All that is left to do is call model.predict
which will return a list of the probabilities that the input matches each category. The one with the highest probability will be selected as the category of the transaction.
Learnings
Native Binary
Internally, TensorFlow is a binary that runs completely separately from Node.js providing bindings for it. The following sections explain two resulting considerations.
Dispose variables
TensorFlow doesn't automatically cleanup memory after model operations like model.fit
, model.predict
etc. Therefore, we have to wrap these operations in tf.engine()
scope calls:
tf.engine().startScope();
// ...
tf.engine().endScope();
tf.engine().disposeVariables();
Running in parallel
We do have multiple workers or processes. If they are interacting with the same TensorFlow.js instance it creates a complication. Possible solutions are to run the processes in sequence, block concurrent access, or separate them into their own instances.
Limitation of tools
A lot of tools to optimize and research the models are written in Python instead of JavaScript. For example, one cannot just use "Tensorboard" to watch how your models behave. For further optimization of our machine learning code, we plan to investigate the further integration of external tools.
Top comments (0)