One of the things I'd like to get a little more into is machine learning. The sticking point I always run into is python. Not that it's a bad language but since it occupies roughly the same spot on the pyramid as javascript I'd rather use javascript. This is also because while python has an amazing machine learning ecosystem, javascript has a better ecosystem for pretty much everything else and since the web is the best delivery platform it would just be easier. That, and python dependencies are somehow worse NPM, full of platform and runtime version dependent code. Every time I try to start something it becomes a mess of installing different python versions, dependencies or worse, VMs.
If we step into the web world there's pretty much one option for autograd framework: tensorflow.js. I've used this a bit but sometimes it feels a bit too high-level and following some pytorch code things can get pretty difficult because they work differently. It also has issues with native binaries. Part of what I wanted was to understand it under-the-hood, but also make something a bit simpler and light-weight even if it has garbage performance.
Before jumping in I'd like to highlight this awesome video series by Andrej Karparthy which contains a lot of stuff I built upon: https://www.youtube.com/watch?v=VMj-3S1tku0
What is an autograd?
So first we should start with what an "autograd" is. It's basically a framework that provides a computational graph. In machine learning all we're doing is putting inputs into an equation and seeing what pops out the other end. The structure of this equation are usually designed but there's usually a lot of coefficients that we want to train to get good output. The process of training involves taking some inputs, calculating the "loss" relative to the ground truth, and then adjusting the parameters so that they get a little bit closer. Concretely, we can think of a image detector. The input might be a image, a 2d array of pixels (or 3d array if we're adding color) and the output might be a boolean that says whether it thought there was a dog. We then start by sending it pictures of things we know contain dogs and things that do not and each time we test if it gets the right answer, and if not we adjust the weights in the network.
Weight adjustment (also called back propagation) is basically application of the chain rule in calculus. If we can determine the gradient at each weight we can figure out which direction to adjust such that it's more likely the get the right answer. So we follow the equation backwards and adjust the weights. The autograd framework does this for us. If we were to follow it back with pen-and-paper it gets complicated quick. In our framework each term becomes a node in a graph and then we can perform these automatically.
Tensors
The basic building block is a tensor. A tensor is pretty much a multidimensional array. When we build these graphs we are pretty much always using blocks of values rather than single values. This is because we can make use of hardware, and writing things parallelized means we can use the massively parallelized hardware in CPUs, GPUs and T/NPUs. We'll be using the CPU here which can only do one at a time but at least in terms of representation we'll think of it in these blocks.
Tensors really need two things. The set of values and a shape. The shape frequently confuses me but the idea is this: Each tensor has a rank which determines how many dimensions it has. So if I said a 2x2 matrix the rank would be 2 because it has two dimensions and the size of each dimension would be 2. If I wanted a block of 2x2x3 then this is a rank 3 tensor with dimensions 2, 2, and 3. The dimensionality is related to the block, not the what the block represents. So while a 1x3 tensor can represent a 3d point, it is not a 3d tensor. The rank can also be important because a 1x3 tensor may not be compatible with an interface that takes a 1x1x3 tensor even if they have the same number of elements. In this case we usually have to do some sort of reshaping operation to make it fit like adding an extra dimension. We'll represent the shape as an array of dimensions. So a 2x2x3 tensor has a shape [2,2,3]
.
Shape notation
One thing we have to be very careful and considerate about is how the shape is represented. I find it most understandable to see the numbers how I say them: x, y ,z. This means that [1,2,3]
means x=1, y=2, z=3 or row=1, column=2, depth=3. This also means that a shape index of 0
is referring to x
because the row is on the left-most side. Libraries like pytorch behave differently and assumes that [1,2,3]
is read like a number where the left-most digit is the most significant eg [Batch, Channel, height, width]
. You can change the notation to resemble that and perhaps I might in the future if I find it more useful to copy pytorch code but we just need to understand what the convention is because our results may differ.
The second part is the values. At first you might think to represent the value using nested arrays in the shape of the tensor but this is inefficient because of how we traverse it. Instead we can just use a flat array and the shape represented. Also, instead of a regular array we should use a TypedArray
. These are more efficient structures and can be passed to low level APIs like those we'd find in WebGPU or WASM in case we want to add more hardware support. For more details on this you can see my other series about matrix math: https://dev.to/ndesmic/series/23073.
To get the length of the flat array we need to take the product of all dimensions in the shape.
Once we have these we have the basic storage mechanism for the tensor node.
//tensor.js
export class Tensor {
#shape;
#values;
constructor({ values, shape }){
const totalLength = shape.reduce((s, x) => s + x);
this.#shape = shape;
if(values){
if(values.length != totalLength) throw new Error(`Supplied values are the wrong length, need ${totalLength}`);
if(values instanceof Float32Array){
this.#values = value;
} else {
this.#values = new Float32Array(values)
}
} else {
this.#values = new Float32Array(totalLength);
}
}
get totalLength(){
return this.#values.length;
}
get shape(){
return this.#shape;
}
get values(){
return this.#values;
}
}
A few things here. We pass in the shape and the values. If we don't have values then we just allocate the array. For convenience values can be an array or a typed array. If it's a plain array we make a typed array but if it's already a typed array we'll just take that as is because then we don't need to re-allocate the array (these can get big!). It's going to be important when dealing with tensors to make sure that the shapes agree. Usually this will mean they have the same total items as we can kinda pretend they are compatible if at least that is true. This means fewer validation checks overall at the expense of being a bit sloppy. Also Tensor
s are immutable, we should try not to change their internal values (though there may be a need at some point, we'll avoid this since it's a way to introduce bugs).
Operations
The operations on the tensors are basically whatever math operations we think we'll need. Basic arithmetic is a good idea and there are many ops that are common in machine learning that we can add as well. Let's start with the basics add
, sub
, div
, mul
.
Note: In python people love operator overloading because it looks clean. I dislike it because it's not clear that something new is happening between two values especially when it's not even clear what type those values are by looking at them (its like string + number
bugs but worse). In Javascript we don't have that feature to even debate the merits so we implement them as actual methods.
export class Tensor {
#shape;
#values;
constructor({ values, shape }){
const totalLength = shape.reduce((s, x) => s + x);
this.#shape = shape;
if(values){
if(values.length != totalLength) throw new Error(`Supplied values are the wrong length,
need ${totalLength}`);
if(values instanceof Float32Array){
this.#values = value;
} else {
this.#values = new Float32Array(values)
}
} else {
this.#values = new Float32Array(totalLength);
}
}
get totalLength(){
return this.#values.length;
}
get shape(){
return this.#shape;
}
get values(){
return this.#values;
}
add(other){
if(other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for(let i = 0; i < this.totalLength; i++){
values[i] = this.#values[i] + other.values[i];
}
return new Tensor({ values, shape: this.#shape });
}
sub(other){
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] - other.values[i];
}
return new Tensor({ values, shape: this.#shape });
}
mul(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] * other.values[i];
}
return new Tensor({ values, shape: this.#shape });
}
div(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] * other.values[i];
}
return new Tensor({ values, shape: this.#shape });
}
}
A little verbose but I'm wary of doing code size optimization due to function call overhead (we can evaluate this later if it's not a big deal). We use it like this:
const t1 = new Tensor({ shape: [2,2], values: [1,2,3,4] });
const t2 = new Tensor(new Tensor({ shape: [2,2], values: [5,6,7,8] }));
const result = t1.add(t2);
This doesn't really get us very much yet. We've wrapped this concept of basic math ops but we can't do much else yet.
Building a graph
In order to actually build a graph we need references to which things created the current value. So for an expression like 1 + 2 = 3
, we have 3 nodes 1
, 2
, and 3
. 1
and 2
we created but 3
is the result of adding the two of them. As such we want a tree that looks like this (the +
is contained in the Tensor
but it's easier to draw it like this):
We also need the operation that spawned it. So for these we can take in 2 new options children
and op
. Then when we do an op we pass these into the newly created Tensor
.
//tensor.js
//class Tensor {...
//constructor and add method
#childen;
constructor({ values, shape, children, op }){
const totalLength = shape.reduce((s, x) => s + x);
this.#shape = shape;
if(values){
if(values.length != totalLength) throw new Error(`Supplied values are the wrong length,
need ${totalLength}`);
if(values instanceof Float32Array){
this.#values = values;
} else {
this.#values = new Float32Array(values)
}
} else {
this.#values = new Float32Array(totalLength);
}
this.#children = children ?? [];
this.#op = op ?? null;
}
get children(){
return this.#children;
}
// ...
add(other){
if(other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for(let i = 0; i < this.totalLength; i++){
values[i] = this.#values[i] + other.values[i];
}
return new Tensor({ values, shape: this.#shape, children: [this, other], op: "+" });
}
//...do the same for the other ops
Getting the gradient
Next we want to get the gradient. So we build up an expression let's say:
(2 + 3) * (4 + 5) + 6 * 7 - 8
Let's instead give the terms labels
(a + b) * (c + d) + e * f - g
where:
a = 2
b = 3
c = 4
d = 5
e = 6
f = 7
g = 8
This would produce a tree like:
So for each node we want to compute the gradient. The gradient is how fast the the value is changing with respect to that node, or in calculus terms it's the derivative of abcdefg
with respect to that node. We first start with the final node abcdefg
. When abcdefg
changes how much does abcdefg
change? Well, it's as much as abcdefg
changes! So the derivative of abcdefg
with respect to itself is 1
. Now we visit the next two nodes. How about abcdefg
with respect to g
? Reaching into some basic calculus the answer is -1
because dc/da = b - a
is -1
(using the sum rule and power rule of differentiation b
is 0
and a
is divided by a
). This also mean with respect to abcdef
is 1
(there's no negative term on this side). This should be somewhat intuitive if you think about how the output changes if you increase or decrease each term.
- grad of
abcdefg
with respect toabcdefg
= 1 - grad of
abcdefg
with respect toabcdef
= 1 - grad of
abcdefg
with respect tog
= -1
This is pretty easy, but now we get to the real meat. What about the derivative of abcdefg
with respect to ef
or abcd
? To do this we first should get the local derivative so the derivative of abcdef
with respect to abcd
and ef
. Again this is just addition so for both it's 1
- grad of
abcdef
with respect toabcd
= 1 - grad of
abcdef
with respect toef
= 1
Now the real magic is to apply the chain rule of calculus which basically says that we need to multiply the gradient of the parent node with the local derivative to obtain the derivative with respect to the current node. So if we know "grad of abcdefg
with respect to abcdef
is 1
" and "grad of abcdef
with respect to abcd
is 1
" then the gradient of abcdefg
with respect to abcd
is 1 * 1
or just 1
. The same applies for ef
which is also 1
.
- grad of
abcdefg
with respect toabcd
= 1 - grad of
abcdefg
with respect toef
= 1
Not bad, this was easy so far because it's addition and subtraction. But now let's try multiplication. abcdefg
with respect to e
and f
. First, get local derivatives ef
with respect to e
and f
. ef
with respect to f
from calculus rules would be e
because we're multiplying it by a constant (e
is constant in the expression when fixed with respect to f
). The same applied for with respect to e
as well.
- grad of
ef
with respect tof
=e
- grad of
ef
with respect toe
=f
So now with chain rule we get:
- grad of
abcdefg
with respect tof
=e
- grad of
abcdefg
with respect toe
=f
Because both are multiplied by "grad of abcdefg
with respect to ef
" which is 1
. Hopefully it's enough to see that abcd
by ab
and cd
works the same way.
- grad of
abcdefg
with respect toab
=cd
- grad of
abcdefg
with respect tocd
=ab
Finally:
- grad of
ab
with respect toa
=1
- grad of
ab
with respect tob
=1
- grad of
cd
with respect toc
=1
- grad of
cd
with respect tod
=1
- grad of
abcdefg
with respect toa
=cd
- grad of
abcdefg
with respect tob
=cd
- grad of
abcdefg
with respect toc
=ab
- grad of
abcdefg
with respect tod
=ab
And we can plug our numbers back in:
- grad of
abcdefg
with respect toa
=9
- grad of
abcdefg
with respect tob
=9
- grad of
abcdefg
with respect toc
=7
- grad of
abcdefg
with respect tod
=7
- grad of
abcdefg
with respect toe
=7
- grad of
abcdefg
with respect tof
=6
- grad of
abcdefg
with respect tog
=1
Automating gradient calculations
To automate we'll create a backward
method on the Tensor
that will propagate the gradient from a node to it's children. So as long as we recursively call this on the graph we can get the derivative with respect to any node. The annoying part here though is that it's very stateful and requires some external references. I couldn't find a super good way to clean it up, so basically we just need to be careful about mucking around with the gradient
property and calling backward
because if it's not done as expected we'll get wrong answers.
//tensor.js
const backward = Symbol("backward");
//class Tensor { ...
[backward] = () => {};
#gradient = 0;
constructor(){
//...
this.#gradient = new Float32Array(totalLength);
}
set [setGradient](value){
this.#gradient = value;
}
get gradient(){
return this.#gradient;
}
add(other){
if(other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for(let i = 0; i < this.totalLength; i++){
values[i] = this.#values[i] + other.values[i];
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this, other],
op: "+"
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += result.gradient[i];
other.gradient[i] += result.gradient[i];
}
};
return result;
}
To add a little bit of protection so that users know it's private I've add the symbol const backward = Symbol("backward");
such that you cannot access backward
outside of the Tensor
context but it's needed as an external prop because we need reference to the output tensor to hook it up. gradient
needs to be public but it's modifiable so we need to just be careful. Keep in mind that we need a gradient per element, so the length and shape of gradient
should be the same as values
(I'm not adding guards for this because this all happens internally so presumably we won't screw it up). I want to start each Tensor
with it's own gradient array allocated too. This is because we'll need to modify it later and for performance reasons it's going to be in-place modification but it still should be a copy for add because we don't want to modify the parent's gradient array.
Also note that the gradient distributions use +=
. This is for when a node appears more than once such a a + a = b
. In this case we need to add up the gradient contribution from each term. In such a trivial case we can see a + a = 2a
and the derivative of 2a
with respect to a
is 2
but in a more complicated case it may not simplify nicely. Had we not done this, we'd get 1
instead of 2
. As long as we always add the contributions up this should work.
The above shows add
. sub
is similar but we need to account for the asymmetry:
//Tensor.js - sub method
sub(other){
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] - other.values[i];
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this, other],
op: "-"
});
result[backward] = () => {
for(let i = 0; i < this.totalLength; i++){
this.gradient[i] += result.gradient[i];
other.gradient[i] += -result.gradient[i];
}
};
return result;
}
This time we actually need to make a modification which means allocating a new copy of gradients and modifying them. For mul
we need to actually iterate over gradient elements:
//Tensor.js - mul method
mul(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] * other.values[i];
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this, other],
op: "*"
});
result[backward] = () => {
for(let i = 0; i < this.totalLength; i++){
this.gradient[i] += other.values[i] * result.gradient[i];
other.gradient[i] += this.values[i] * result.gradient[i];
}
};
return result;
}
Be sure to note that where we're using the values
and not the gradients
. div
can also be a bit trickier because it's not symmetrical.
//tensor.js - div method
div(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length,
argument was ${other.totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = this.#values[i] / other.values[i];
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this, other],
op: "/"
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += 1 / other.values[i] * result.gradient[i];
other.gradient[i] += -(this.values[i] / other.values[i] ** 2) * result.gradient[i];
}
};
return result;
}
Once we've dealt with all of our operations, we need to add a method to start the backpropagation process, I call this backward
(not the symbol backward
which is the internal implementation). It looks like this:
//tensor.js - backward method
backward(){
this.#gradient = new Float32Array(this.totalLength).fill(1);
const sortedDependencies = topologicalSort(this, x => x.children).reverse();
for(const node of sortedDependencies){
node[backward]();
}
}
Like we did before it start with a gradient of 1
and then recursively goes back through the graph calling the backward implementations which will fill in the gradient
properties of each node the starting node depends on (in practice we will probably not call backward
on anything but the final result node). We need the topological sort which just depth-first traverses the tree and adds the nodes into a list. The order is actually backwards though, with the top-most node at the end. If we reverse it, then we will have a proper order from the top-down and the traversal means that we'll never try to process a node that does not have it's result gradient already filled.
The topological sort looks like this:
//topological-sort.js
export function topologicalSort(startingNode, childAccessor) {
const visited = new Set();
const topology = [];
function visit(node) {
if (!visited.has(node)) {
visited.add(node);
const children = childAccessor(node);
for (const child of children) {
visit(child);
}
topology.push(node);
}
}
visit(startingNode);
return topology;
}
The only interesting thing here is the childAccessor
which helps make the algorithm generic. This is a function that selects the child nodes so that we're not tied to one implementation of a tree (you don't actually need to be this generic if you're not planning on using it elsewhere).
Some more ops
We've done some basic arithmetic but machine learning requires some non-linear and unary ops too so let's add a few.
Tanh
The hyperbolic tangent operator.
tanh(){
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = Math.tanh(this.#values[i])
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this],
op: "tanh"
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += (1 - Math.tanh(this.values[i]) ** 2) * result.gradient[i];
}
}
return result;
}
Pow
The power operator x ** y
. Note that we could have a static version that just takes a constant, but this is a full version that differentiates back to the exponential value.
pow(other) {
if (other.totalLength != this.totalLength) throw new Error(`Tensor not the right length, argument was ${other.
totalLength}, needs to be ${this.totalLength}`);
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = Math.pow(this.#values[i], other.values[i]);
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this],
op: `pow`
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += other.values[i] * Math.pow(this.values[i], other.values[i] - 1) * result.gradient[i];
other.gradient[i] += Math.log(this.values[i]) * Math.pow(this.values[i], other.values[i]) * result.gradient[i];
}
}
return result;
}
Exp
The exponentiation operator.
exp() {
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = Math.exp(this.#values[i])
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this],
op: "exp"
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += Math.exp(this.values[i]) * result.gradient[i];
}
}
return result;
}
Neg
Negation, e.g. -2
.
neg() {
const values = new Float32Array(this.totalLength);
for (let i = 0; i < this.totalLength; i++) {
values[i] = -this.#values[i];
}
const result = new Tensor({
values,
shape: this.#shape,
children: [this],
op: `neg`
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += -1 * result.gradient[i];
}
}
return result;
}
Reduction (sums)
Also for a little extra let's try a reduction operation: sum
. This adds up the values in a tensor across an axis. We call it a "reduction" because it changes the shape of the tensor (it's the n-dimensional equivalent of an Array.reduce
operation) This can open up a lot of various API considerations, what does it mean to sum in multiple dimensions? To keep it simple for now let's just assume it's across a single dimension. So if you have a cube of 27 numbers (3x3x3), you need to select one dimension 0
, 1
, or 2
and it will sum across it. Another consideration is what the shape of the output looks like. It could be 3x3 or 3x3x1. In pytorch and tensorflow there's a keepDim/keepDims
option for reduction ops to choose which you want.
For example let's consider our number cube (we'll pretend the tensor is a structured js array to help visualize):
const input = [ //depth
[ //height
[1,2,3], //width
[4,5,6],
[7,8,9]
],
[
[10,11,12],
[13,14,15],
[16,17,18]
],
[
[19,20,21],
[22,23,24],
[25,26,27]
]
]
const sum0_a = input.sum({ dimension: 0, keepDims: true });
/*
[
[
[6],
[15],
[24]
],
[
[33],
[42],
[51]
],
[
[60],
[69],
[78]
]
]
*/
const sum0_b = input.sum({ dimension: 0, keepDims: false });
/*
[
[6,15,24],
[33,42,51],
[60,69,78]
]
*/
const sum1_a = input.sum({ dimension: 1, keepDims: true });
/*
[
[
[12],[15],[18]
],
[
[39],[42],[45]
],
[
[66],[69],[72]
]
]
*/
const sum1_b = input.sum({ dimension: 1, keepDims: false });
/*
[
[12,15,18],
[39,42,45],
[66,69,72]
]
*/
We want something a bit like this except it will actually operate on a flat array with a shape for performance reasons.
So the first issue is how do we select on a dimension? We have a shape and list of values, but since everything was element-wise we didn't really have to consider it.
We need to consider what order things are stored and pick some conventions. I'm not really sure the best one to pick in terms of performance, so for now I think it's down to intuition. Intuitively, I think of indices like [2,3,4]
as x=2, y=3, z=4
which is column-major (outermost dimension is last), this means that shape
is also column-major. However, when I write the tensor values out on paper I use row-major ordering. In terms of a tensor, row-major means that the inner-most dimension, or the dimension that varies the quickest is typically last. This means that the my intuition on indexing format and intuition on storage are reversed. For debuggability, I think I'll keep both conventions and deal with conversion (it just involves some array reversing) but there's maybe a bit to be gained in terms of code if the conventions were different.
//tensor-utils.js
export function getFlatIndex(colMajorIndices, colMajorShape) {
if (colMajorIndices.length != colMajorShape.length) throw new Error(`Indices count must match shape. indices length was ${colMajorIndices.length}, shape has length ${colMajorShape.length}.`);
const rowMajorShape = colMajorShape.toReversed();
const rowMajorIndices = colMajorIndices.toReversed();
let index = 0;
for (let i = 0; i < rowMajorShape.length; i++) {
index *= rowMajorShape[i];
index += rowMajorIndices[i];
}
return index;
}
export function getDimensionalIndices(flatIndex, colMajorShape) {
const indices = [];
for (const size of colMajorShape) {
indices.push(flatIndex % size);
flatIndex = Math.floor(flatIndex / size);
}
return indices;
}
These represent going from a multi-dimension index to a flat index and back again. As explained we need to reverse the shape and the indices to get them in the expected row-major format. Then we go through the shape and start adding offsets per dimension until we hit the end. The reverse process normally would use reversed arrays but because we're already doing reversing to account for the API it's not necessary. Be sure to note toReversed()
and not reverse()
itself. We do not want to modify the shape or indices passed in as this is a hazard and not expected.
Next, let's consider the new shape. It's the shape array with the dimension we selected removed.
const newShape = this.#shape.filter((_, i) => i !== dim);
We can take it's length like before, which is the product of all dimension sizes
const outputLength = newShape.reduce((prod, x) => prod * x, 1);
Then we'll iterate over each element in the output array. This is because we'll need to get the reduction for each element, and I find it's much easier to think in terms of the output than in terms of the input tensor. We take the flat index and convert it into a tuple of dimensional indices, you can think of this as the same indices used to index into the original tensor with one dimension removed. The element that was removed has the index of the dimension. For example if we have indices [1,2,3]
, and we're reducing on dimension 1
then we think of it like [1,x,2,3]
into the original tensor. To represent this we simply splice a value into the array at dimension
but this value will vary, it's a loop over the size of the selected dimension. So we run this loop for all possible values for dimension 1
and sum them up.
//tensor.js - Tensor class sum method
sum({ dimension }){
const newShape = this.#shape.filter((_, i) => i !== dimension);
const outputLength = newShape.reduce((prod, x) => prod * x, 1);
const output = new Array(outputLength).fill(0);
for (let i = 0; i < output.length; i++) {
const newIndices = getDimensionalIndices(i, newShape);
for (let j = 0; j < this.#shape[dimension]; j++) {
const oldIndices = newIndices.toSpliced(dimension, 0, j);
const oldFlatIndex = getFlatIndex(oldIndices, this.#shape);
output[i] += this.#values[oldFlatIndex]
}
}
const result = new Tensor({
values: output,
shape: newShape,
children: [this],
op: "sum"
});
return result;
}
However, we're still missing a couple things. One is the keepDims
param that we'll add later (it's easy) but the bigger one is what exactly do we do about the gradient propagation? It's a similar thing to add we're just adding a lot of things within the tensor itself so instead of propagating to other tensors, we'll be propagating to other elements.
Let's say we have our tensor
const t = new Tensor({ values: [1,2,3,4,5,6,7,8,9], shape: [3,3] })
const t2 = t.sum({ dimension: 0 });
Spelled out we're going to add up the rows like this:
r0 = t[0][0] + t[0][1] + t[0][2]
r1 = t[1][0] + t[1][1] + t[1][2]
r2 = r[2][0] + t[2][1] + t[2][2]
return [r0, r1, r2]
So what is the derivative of r0
with respect to t[0][0]
, t[0][1]
and t[0][2]
?
Well the derivative of r0
with respect to t[0][0]
is just 1
because the other terms become constants and fall away. This means at least for sums it's just the gradient of the result. But we do need to reverse map which result element comes from each input element. If we started with a [10,3]
and summed over dimension 0
we get a size [3]
output. This means that 10
inputs map to each output. So for each of those we get the corresponding output gradient and multiply it by one, or more simply just pass it through.
//tensor.js - backward function for sum
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
const inputIndices = getDimensionalIndices(i, this.#shape);
const outputIndices = inputIndices.toSpliced(dimension, 1);
const outputFlatIndex = getFlatIndex(outputIndices, newShape);
this.gradient[i] += result.gradient[outputFlatIndex];
}
}
We basically do the reverse of the index conversion. We get the input dimensional indices and just delete the dimension we reduced, this gets us the proper element in the out and we just map it back to a flat index to get the element.
Fixing the dimensions is easy. If keepDims
is true then we simply splice in a 1
where the dimension was.
const outShape = keepDims ? newShape.toSpliced(dimension, 0, 1) : newShape;
Here's the whole thing:
//tensor.js - sum method
sum({ dimension, keepDims }){
const newShape = this.#shape.filter((_, i) => i !== dimension);
const outputLength = newShape.reduce((prod, x) => prod * x, 1);
const output = new Array(outputLength).fill(0);
for (let i = 0; i < output.length; i++) {
const newIndices = getDimensionalIndices(i, newShape);
for (let j = 0; j < this.#shape[dimension]; j++) {
const oldIndices = newIndices.toSpliced(dimension, 0, j);
const oldFlatIndex = getFlatIndex(oldIndices, this.#shape);
output[i] += this.#values[oldFlatIndex]
}
}
const result = new Tensor({
values: output,
shape: keepDims ? newShape.toSpliced(dimension, 0, 1) : newShape,
children: [this],
op: "sum"
});
result[backward] = () => {
for (let i = 0; i < this.totalLength; i++) {
this.gradient[i] += result.gradient[i];
}
}
return result;
}
It took a while to wrap my head around the conceptual parts but the actual code produced isn't too bad. If you wanted to implement a product or an average function it's the same process of mapping indices, you just replace the operation and you'll have to figure out the how to get the partial derivative for each part of the operation.
Printing
Printing the tensor is actually something worth doing directly because they are full of nested values and will look extremely ugly in the debugger making them a pain to deal with. The easiest thing here is adding the toString
method as well as [Symbol.for("Deno.customInspect")]
which will cause it to pretty print when logged or inspected in Deno (which is what I'm using). We can also add a label
option for the tensor which will allow us to more easily identify it for debugging as it's pretty much impossible to tell what we're looking at otherwise (this is added as an option in the Tensor
constructor and not shown here).
//tensor.js - toString method
toString(){
return `<${this.#label ? `${this.#label}:` : ""}${Array.from(this.#values).join(", ")}>`;
}
[Symbol.for("Deno.customInspect")]() {
return this.toString();
}
There may be other optimizations to the output we do later especially when these things start getting big.
There's also an op
option for tensors. This is also just a labeling thing so it's easier to track down tensors made from ops that don't have manually passed in labels.
Random Numbers
One other thing to talk about is random numbers. I added a method Tensor.random(shape)
that creates a Tensor
of the desired shape and fills it with random values.
static random(shape, options = {}) {
const length = getTotalLength(shape);
const values = new Float32Array(length);
const generator = options.generator ?? getRandom(options.min, options.max, options.seed);
values.set(generator.take(length).toArray(), 0);
return new Tensor({ values, shape });
}
In a naïve implementation you would probably just call Math.random
. However, one draw back is that it would be nice to have some replayability, either to compare implementations or just to be able to test with the same values. Javascript does not give us access to seeded randoms so we have to build our own. The easiest way to do this is a Park-Miller (Lehmer) random number generator. The implementation is simple thankfully (well the code is, the math I don't quite get myself). Here's a version with some extra features.
export function* getRandom(min = 0, max = 1, seed = undefined) {
const mod = 0x7fffffff;
//seeds must be less than mod
seed = seed ?? Math.floor(Math.random() * (0x7fffffff - 1));
let state = seed % mod;
const length = (max - min) / mod;
while (true) {
state = (16807 * state) % mod;
yield (length * state) + min;
}
}
The most important thing here are those hardcoded numbers. Basically we're taking some prime numbers and doing modulo arithmetic. It so happens that this gives us pretty random looking behavior. It's not random obviously but it's good enough especially for non-cryptographic purposes. We need to feed it a seed value and then it generates future values based off that initial seed value. If a seed is not fed then we just pick a random one. This value must be less than mod
to be valid. I've also added some scaling so that the value falls between min
and max
. Since it's a generator it outputs an iterator and can generate an infinite random sequence. Since our random
function above can take one of these iterators (I call it "generator" which is a misnomer from the javascript nomenclature but I feel it reads better) and in this way we can seed it such that all random numbers are picked based on a single seed and are reproducible. In fact I even have tests for this method which are deterministic due to this.
Future
Just as a note but as of this post there is an in-progress standard for WebNN which does roughly the same thing this does but is specifically designed to use the underlying hardware whether that be CPU, GPU or TPU. Even with something like WebGPU, we don't have access to all of the machine learning hardware in a modern SoC so we still need things like this in the browser/runtime to unlock the full hardware potential. I think it's at least interesting to take a look at the API as they've designed it and perhaps this is the way to go with this project in the future.
Op table
For convenience I've made a table of the ops implemented to reference. a
is the current tensor, b
is the additional tensor arg (if present), r
is the result tensor, grad_r
is the gradient of the result. reduceDim
(not actually abstracted in my code currently) takes a
and reduces the dim
using a function. mapIndex
(not actually abstracted in my code currently) takes a flat index and maps it back to another flat index using the shape info and the reduced dimension.
Op | forward | grad_a[i] |
grad_b[i] |
---|---|---|---|
add | a[i] + b[i] |
grad_r[i] |
grad_r[i] |
sub | a[i] - b[i] |
grad_r[i] |
-grad_r[i] |
mul | a[i] * b[i] |
grad_r[i] * b[i] |
grad_r[i] * a[i] |
div | a[i] / b[i] |
1 / b[i] * grad_r[i] |
-a[i] / b[i] ** 2 * grad_r[i] |
neg | -a[i] |
-grad_r[i] |
|
pow | Math.pow(a[i],b[i]) |
b[i] * Math.pow(a[i], b[i] - 1) * grad_r[i] |
Math.log(a[i]) * Math.pow(a[i], b[i]) * grad_r[i] |
exp | Math.exp(a[i]) |
Math.exp(a[i]) * grad_r[i] |
|
tanh | Math.tanh(a[i]) |
1-Math.tanh(a[i]) ** 2 * grad_r[i]* |
|
sum | reduceDim(a, dim, (acc, x) => acc + x) |
grad_r[mapIndex(i, inShape, outShape, dim)] |
Code
https://github.com/ndesmic/mljs/tree/v1.0
I've included the code for this here. In this repo there is both Tensor
and Value
. If you want to see an even simpler scalar-only version of the compute graph, take a look at that.
Top comments (0)