Forem

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

Fully Automated Gradient Calculation on Expression Graph (As Explained By Karpathy)

Hi there! I'm Shrijith Venkatrama, founder of Hexmos. Right now, I’m building LiveAPI, a tool that makes generating API docs from your code ridiculously easy.

In the previous posts, we manually calculated gradient values for each node via backpropagation.

In this post - we will see how these manual calculations can be automated step by step to perform automatic back-propagation.

Implementing Automatic Back-Propagation for Each Node

Object Model

We implement node-level grad when the Value is created.

First we do the "forward pass", that is, transform input to output using the operation.

And then immediately, we differentiate the operation, and calculate the grad for the input(s).

You can see in the code below that for each operation, we have added a custom _backward property which can trigger the backpropagation calculation.

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label
    self.grad = 0.0
    self._backward = lambda: None # by default doesn't do anything (for a leaf
                                  # node for ex)

  def __repr__(self):
    return f"Value(data={self.data})"

  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')

    # derivative of '+' is just distributing the grad of the output to inputs
    def backward():
      self.grad = 1.0 * out.grad
      other.grad = 1.0 * out.grad

    out._backward = backward

    return out

  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')

    # derivative of `mul` is gradient of result multiplied by sibling's data
    def backward():
      self.grad = other.data * out.grad
      other.grad = self.data * out.grad

    out._backward = backward

    return out

  def tanh(self):
      x = self.data
      t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
      out = Value(t, (self, ), 'tanh')

      # derivative of tanh = 1 - (tanh)^2
      def backward():
        self.grad = (1 - t**2) * out.grad

      out._backward = backward
      return out

Enter fullscreen mode Exit fullscreen mode

To start with, we initialize the expression graph - without doing any backprop activities. So grad in each node is 0:

# inputs x1, x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

# weights w1, w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

# bias of the neuron
b = Value(6.8813735870195432, label='b')

x1w1 = x1 * w1; x1w1.label = 'x1*w1'
x2w2 = x2 * w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'

n = x1w1x2w2 + b; n.label = 'n'

o = n.tanh(); o.label = 'o'

draw_dot(o)
Enter fullscreen mode Exit fullscreen mode

Before Backprop

Now we are ready to trigger backprop on each operation:

o.grad = 1.0 # base case - needs to be set explicitly
o._backward()
n._backward()
b._backward() # does nothing - leaf node
x1w1x2w2._backward()
x1w1._backward()
x2w2._backward()
draw_dot(o)
Enter fullscreen mode Exit fullscreen mode

The result matches our manual calculations from the previous post:

After

Fully Automated BackPropagation

In the previous section - we got a sort of semi automated backprop.

We still had to manually pick nodes from "last" to "first", and call backprop in that order.

So the first task is to get a list of nodes in the "last" to "first" order.

For this, we are going to implement the topological sort idea into our program - so we always process nodes in the correct order

Topological Sort

Topological Sort

Code:

topo = []
visited = set()
def build_topo(v):
    if v not in visited:
        visited.add(v)
        for child in v._prev:
            build_topo(child)
        topo.append(v)
build_topo(o)
topo
Enter fullscreen mode Exit fullscreen mode

The above code gives us the nodes in the "first" to "last" order:

TopoResult

Using Topological Sort Results to Automate Backprop For the Entire Graph

Code:

o.grad = 1.0

topo = []
visited = set()
def build_topo(v):
    if v not in visited:
        visited.add(v)
        for child in v._prev:
            build_topo(child)
        topo.append(v)
build_topo(o)

for node in reversed(topo):
    node._backward()

draw_dot(o)
Enter fullscreen mode Exit fullscreen mode

And voila - we get the gradient value for each node - in a fully automated manner in autograd:

Automated Result

Integrating the automated backprop logic into Value class

We add a new method backward to the Value class to perform backprop automatically from that node to all the previous nodes:

class Value:
  def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self._prev = set(_children)
    self._op = _op
    self.label = label
    self.grad = 0.0
    self._backward = lambda: None # by default doesn't do anything (for a leaf
                                  # node for ex)

  def __repr__(self):
    return f"Value(data={self.data})"

  def __add__(self, other):
    out = Value(self.data + other.data, (self, other), '+')

    # derivative of '+' is just distributing the grad of the output to inputs
    def backward():
      self.grad = 1.0 * out.grad
      other.grad = 1.0 * out.grad

    out._backward = backward

    return out

  def __mul__(self, other):
    out = Value(self.data * other.data, (self, other), '*')

    # derivative of `mul` is gradient of result multiplied by sibling's data
    def backward():
      self.grad = other.data * out.grad
      other.grad = self.data * out.grad

    out._backward = backward

    return out

  def tanh(self):
      x = self.data
      t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
      out = Value(t, (self, ), 'tanh')

      # derivative of tanh = 1 - (tanh)^2
      def backward():
        self.grad = (1 - t**2) * out.grad

      out._backward = backward
      return out

  def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1.0
    for node in reversed(topo):
        node._backward()
Enter fullscreen mode Exit fullscreen mode

Now I can just call the following to get calculate gradients for all the nodes automatically:

o.backward()
draw_dot(o)
Enter fullscreen mode Exit fullscreen mode

Final Result

Reference

The spelled-out intro to neural networks and backpropagation: building micrograd - YouTube

Top comments (0)