Fatih Küçükkarakurt

Posted on Jan 1

How Does Deep Learning Work? Can You Write Simple Deep Learning Code at Home?

#deeplearning #c #python #machinelearning

Artificial Neural Networks (ANNs) have revolutionized the fields of artificial intelligence and machine learning, serving as the backbone for a wide range of applications, from image recognition to natural language processing. While powerful frameworks like TensorFlow and PyTorch make building deep learning models accessible, there is tremendous educational value in implementing neural networks from scratch.

This guide aims to empower enthusiasts and researchers by demonstrating how to construct and train neural networks at a fundamental level using ANSI C and Python. By combining the low-level control of C with Python's simplicity, we will explore every detail of neural network theory and implementation. Whether you are an aspiring engineer or a seasoned programmer, this step-by-step guide will equip you with the knowledge to design, implement, and train your own neural networks, laying the foundation for tackling real-world problems and advancing your understanding of deep learning.

Because I realized that there are a lot of courses and books on the market. However, these contents always explain what certain libraries and frameworks can do. In fact, they do not explain much about what is behind the scenes. This means a rote learning system. So let's all try to learn what is behind this.

Theoretical Foundations of Neural Networks

Artificial Neural Networks (ANNs) are computational frameworks inspired by the structure and function of biological neural networks. Their theoretical foundation is deeply rooted in mathematics, computer science, and neuroscience, integrating concepts from linear algebra, calculus, and optimization. This section delves into the fundamental principles that govern neural networks, exploring their architecture, mathematical formulations, and learning dynamics.

1. Neural Network Architecture

The architecture of a neural network is defined by its layers: input, hidden, and output. Each layer comprises interconnected neurons, or nodes, where:

Input Layer serves as the entry point, receiving raw data.
Hidden Layers perform nonlinear transformations, extracting abstract features from input data.
Output Layer provides the final predictions or classifications.

The connectivity and arrangement of these layers dictate the network's ability to model complex functions, with deeper networks enabling hierarchical feature extraction.

2. The Neuron Model

At the core of a neural network is the artificial neuron, a mathematical abstraction of biological neurons. The neuron operates as follows:

z = \sum_{i=1}^n w_i x_i + b

Here:

$x_i$ : Input features.
$w_i$ : Weights assigned to inputs.
$b$ : Bias term, enhancing flexibility in decision boundaries.

The output, $a$ , is obtained by applying an activation function, $\phi$ , to $z$ :

a = \phi(z)

3. Activation Functions

Activation functions introduce nonlinearity, enabling neural networks to approximate complex, nonlinear functions. Commonly used functions include:

Sigmoid: Smooth, bounded between 0 and 1. Ideal for probabilities but prone to vanishing gradients.
ReLU (Rectified Linear Unit): Efficient and computationally simple, mitigating the vanishing gradient problem.
Tanh: Zero-centered, offering improved convergence compared to sigmoid.

The choice of activation function significantly impacts the network’s learning capability and convergence speed.

4. Forward Propagation

Forward propagation is the process of passing input data through the network to compute output predictions. Mathematically, this involves:

Computing the weighted sum of inputs for each neuron.
Applying the activation function to obtain the neuron’s output.
Propagating the outputs layer by layer until the final prediction is produced.

This mechanism represents the evaluation of the network's hypothesis function, mapping inputs to outputs.

5. Loss Functions

To train a neural network, a loss function quantifies the difference between predicted and actual outputs. Common loss functions include:

Mean Squared Error (MSE): Used for regression tasks.

$L = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$
Cross-Entropy Loss: Employed in classification problems.

$L = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]$

The choice of loss function determines the optimization objective, directly influencing the learning process.

6. Backpropagation and Gradient Descent

Backpropagation is the cornerstone of neural network training, enabling efficient computation of gradients. This process involves:

Computing Gradients: Using the chain rule to calculate the derivative of the loss with respect to weights and biases.
Updating Parameters: Employing gradient descent to minimize the loss function. The update rule is given by: $w \gets w - \eta \frac{\partial L}{\partial w}$ where $eta$ is the learning rate.

Backpropagation ensures that errors are propagated backward through the network, systematically updating weights to optimize performance.

7. The Universal Approximation Theorem

The theoretical underpinning of neural networks lies in the Universal Approximation Theorem, which states:

A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$ , given appropriate activation functions.

This theorem underscores the expressive power of neural networks, emphasizing their capacity to model arbitrarily complex relationships.

8. Challenges and Limitations

Despite their theoretical strengths, neural networks face several challenges:

Overfitting: Occurs when the model memorizes training data instead of generalizing.
Vanishing/Exploding Gradients: Impedes effective learning in deep networks.
Computational Complexity: Training deep networks requires significant computational resources.

Advanced techniques such as regularization, batch normalization, and residual connections have been developed to address these limitations.

Why Implement Neural Networks in ANSI C and Python?

Building neural networks from scratch in ANSI C and Python serves not only as an educational endeavor but also as a means to gain a profound understanding of the underlying mechanics of machine learning. This approach, while less common in production environments dominated by high-level frameworks, offers several compelling benefits and unique insights.

1. Pedagogical Value

Implementing neural networks from first principles allows practitioners to demystify the "black box" nature of deep learning models. By coding each layer, activation function, and backpropagation algorithm, one develops a granular understanding of:

The flow of data through the network during forward and backward passes.
The role of gradients in optimizing weights and biases.
The mathematical basis of concepts such as loss functions and gradient descent.

This hands-on approach transforms theoretical knowledge into practical expertise.

2. Full Control and Customization

Frameworks like TensorFlow and PyTorch abstract away many implementation details, which can be a limitation when attempting to experiment with novel architectures or optimization techniques. Implementing neural networks in ANSI C and Python provides full control over:

Memory allocation and management, critical in resource-constrained environments.
Custom optimization strategies and activation functions.
Specialized model designs tailored to unique datasets or applications.

This flexibility is invaluable for research and experimentation.

3. Performance Insights

ANSI C is known for its efficiency and low-level control over hardware, making it ideal for understanding performance bottlenecks in neural network computations. By implementing neural networks in C:

One learns the impact of computational efficiency on training time.
Optimization techniques, such as minimizing cache misses and efficient matrix operations, become apparent.
It offers insights into the trade-offs between precision, speed, and memory usage.

Similarly, Python, with its simplicity and versatility, acts as an excellent complement, enabling rapid prototyping while maintaining an accessible interface for experimentation.

4. Bridging Theory and Practice

High-level frameworks often obscure the relationship between theoretical concepts and their practical implementation. Writing neural networks in C and Python bridges this gap by:

Translating abstract mathematical equations into executable code.
Allowing direct visualization of each computational step, such as weight updates and error propagation.
Reinforcing the connections between theoretical models and real-world implementation.

This dual perspective strengthens one's ability to design and debug complex models effectively.

5. Cross-Language Expertise

Combining ANSI C and Python leverages the strengths of both languages:

ANSI C provides unparalleled control, enabling the development of optimized, hardware-level applications suitable for embedded systems or performance-critical environments.
Python offers simplicity, readability, and access to extensive scientific libraries for data preprocessing and visualization.

This synergy not only enhances coding versatility but also prepares practitioners for diverse real-world applications, from low-level embedded systems to high-level data analysis pipelines.

6. Learning from First Principles

Implementing neural networks from scratch encourages a first-principles approach, which is critical for innovation. This method involves:

Identifying fundamental concepts and assumptions.
Building a system piece by piece, gaining intuition about each component’s role.
Exploring the boundaries of existing methods, which can lead to novel discoveries.

Practitioners who understand systems at this level are better equipped to create breakthroughs in machine learning.

7. Preparing for Edge Computing

With the growing trend of deploying machine learning models on edge devices, resource efficiency becomes paramount. ANSI C's lightweight nature makes it an excellent choice for training and deploying models on constrained devices. Understanding how to implement neural networks in C prepares practitioners to:

Optimize models for low-power hardware, such as IoT devices and microcontrollers.
Minimize memory usage and maximize computational efficiency.
Develop portable solutions that integrate seamlessly with hardware systems.

8. Encouraging Innovation Beyond Frameworks

Relying solely on existing machine learning frameworks can limit creativity and restrict problem-solving approaches to the capabilities of those frameworks. By stepping outside these boundaries, practitioners can:

Develop unconventional architectures and optimization strategies.
Integrate machine learning directly into custom systems, bypassing unnecessary overhead.
Push the limits of what is achievable with current tools.

Mathematical Foundations

The mathematical foundations of neural networks provide the framework for understanding their structure, functionality, and learning dynamics. Rooted in linear algebra, calculus, probability theory, and optimization, these concepts underpin the development of algorithms that enable networks to learn from data and make predictions. A deep understanding of these principles is essential for designing efficient and robust models, paving the way for advancements in artificial intelligence and machine learning.

Linear Algebra for Neural Networks

Linear algebra serves as the cornerstone of neural network operations, enabling efficient representation and computation of multidimensional data. By leveraging vectors, matrices, and tensors, linear algebra facilitates the formulation and optimization of neural network architectures, ensuring scalability and performance in high-dimensional spaces. This section explores the essential concepts and operations in linear algebra that underpin neural network design and functionality.

1. Representing Data with Vectors and Matrices

In neural networks, input data, weights, and activations are often represented using vectors and matrices:

Vectors are one-dimensional arrays used to represent features of a single data instance:

$\mathbf{x} = \begin{bmatrix} x_1 \ x_2 \ \vdots \ x_n \end{bmatrix}, \quad \text{where } x_i \text{ represents the } i\text{-th feature.}$
Matrices extend this concept to multiple instances, with rows corresponding to individual samples and columns to features:

$\mathbf{X} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \ x_{21} & x_{22} & \cdots & x_{2n} \ \vdots & \vdots & \ddots & \vdots \ x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}, \quad \text{where } m \text{ is the number of samples.}$

2. Weight Matrices and Linear Transformations

Weights in a neural network are represented as matrices that facilitate linear transformations of input data:

\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b},

where:

$\mathbf{W} \in \mathbb{R}^{n \times m}$ is the weight matrix, connecting $n$ -dimensional inputs to $m$ -dimensional outputs.
$\mathbf{b} \in \mathbb{R}^m$ is the bias vector, allowing for translation of the output space.

Linear transformations encapsulate the essence of feature extraction, enabling networks to learn meaningful representations of data.

3. Activation Computations Using Dot Products

The computation of activations in a neural network relies heavily on the dot product:

\mathbf{z}j = \sum{i=1}^n w_{ij} x_i + b_j,

where $\mathbf{z}j$ is the pre-activation value of the $j$ -th neuron, calculated as the weighted sum of inputs $\mathbf{x}$ with corresponding weights $\mathbf{w}{ij}$ . This operation is inherently parallelizable, making it efficient for large-scale neural networks.

4. Matrix Multiplication for Layer-Wide Computations

Matrix multiplication generalizes the computation of activations across entire layers:

\mathbf{Z} = \mathbf{X} \mathbf{W}^\top + \mathbf{B},

where:

$\mathbf{X} \in \mathbb{R}^{m \times n}$ contains ( m ) input samples with ( n ) features each.
$\mathbf{W}^\top \in \mathbb{R}^{n \times h}$ is the transposed weight matrix for ( h )-dimensional hidden units.
$\mathbf{B} \in \mathbb{R}^{m \times h}$ broadcasts the bias vector to match the sample size.

Efficient matrix multiplication algorithms are critical to the scalability of neural networks.

5. Tensor Algebra in Deep Networks

For more complex architectures, such as convolutional neural networks (CNNs), data is often represented as tensors:

A tensor is a generalization of vectors and matrices to higher dimensions, accommodating spatial and temporal relationships in data.
Operations such as tensor contraction and broadcasting enable flexible manipulation of these multidimensional arrays.

For instance, a three-dimensional tensor representing an image batch might be expressed as:

\mathcal{X} \in \mathbb{R}^{b \times h \times w \times c},

where $b$ is the batch size, $h$ and $w$ are height and width, and $c$ represents the number of channels.

6. Eigenvalues and Singular Value Decomposition (SVD)

Advanced techniques in linear algebra, such as eigenvalue decomposition and singular value decomposition, play a pivotal role in understanding and optimizing neural networks:

Eigenvalues and Eigenvectors are used to analyze weight matrices, revealing insights into the network’s stability and learning dynamics.
SVD decomposes a matrix $\mathbf{W}$ into: $\mathbf{W} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top,$ where $\mathbf{U}$ and $\mathbf{V}$ are orthogonal matrices, and $\mathbf{\Sigma}$ is diagonal. This decomposition aids in dimensionality reduction and model compression.

7. Numerical Stability and Conditioning

Numerical stability is a critical consideration when implementing neural networks:

Poorly conditioned matrices, characterized by a high condition number, can lead to instability in training.
Techniques such as regularization and preconditioning are employed to mitigate these issues.

For example, the condition number of a matrix $mathbf{W}$ is defined as:

\kappa(\mathbf{W}) = \frac{\sigma_{\max}}{\sigma_{\min}},

where $sigma_{\max}$ and $sigma_{\min}$ are the largest and smallest singular values, respectively.

Activation Functions

Activation functions are integral components of neural networks, determining how the weighted sum of inputs is transformed into an output for each neuron. These functions introduce non-linearity into the network, allowing it to model complex relationships between inputs and outputs. Without activation functions, a neural network would only be able to learn linear mappings, regardless of how many layers it had. Below, we explore the key activation functions used in neural networks, their properties, and their mathematical formulations.

1. Sigmoid Activation Function

The sigmoid function is one of the earliest activation functions used in neural networks. It outputs values in the range

(0, 1)

, making it suitable for binary classification tasks. The mathematical form of the sigmoid function is:

\sigma(x) = \frac{1}{1 + e^{-x}}

where ( x ) is the input to the neuron. The derivative of the sigmoid function is:

\sigma'(x) = \sigma(x)(1 - \sigma(x))

While the sigmoid function was popular in early neural network architectures, it suffers from issues such as the vanishing gradient problem, which limits its ability to train deep networks effectively.

2. Hyperbolic Tangent (tanh)

The tanh function is similar to the sigmoid function but outputs values in the range (-1, 1), making it zero-centered. This helps alleviate some of the issues seen with sigmoid, particularly in terms of the gradient during backpropagation. The mathematical form of the tanh function is:

\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}

The derivative of the $tanh$ function is:

\tanh'(x) = 1 - \tanh^2(x)

Despite its advantages over sigmoid, tanh can still suffer from the vanishing gradient problem, especially when the inputs are far from zero.

3. Rectified Linear Unit (ReLU)

The ReLU activation function has become the most widely used activation function due to its simplicity and ability to mitigate the vanishing gradient problem. It is defined as:

\text{ReLU}(x) = \max(0, x)

where $x$ is the input. ReLU outputs zero for any negative input and the identity function for any positive input. The derivative of the ReLU function is:

\text{ReLU}'(x) = \begin{cases} 0 & \text{if } x < 0 \ 1 & \text{if } x > 0 \end{cases}

Although ReLU helps improve training speed and performance, it has some drawbacks, including the "dying ReLU" problem, where neurons can become inactive if they always output zero.

4. Leaky ReLU

To address the "dying ReLU" problem, the Leaky ReLU function introduces a small slope for negative inputs instead of setting them to zero. It is defined as:

\text{Leaky ReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \ \alpha x & \text{if } x < 0 \end{cases}

where $\alpha$ is a small constant, typically set to $\alpha = 0.01$ . The derivative of the Leaky ReLU function is:

\text{Leaky ReLU}'(x) = \begin{cases} 1 & \text{if } x \geq 0 \ \alpha & \text{if } x < 0 \end{cases}

This modification allows Leaky ReLU to maintain a gradient for negative inputs, helping prevent neurons from "dying" during training.

5. Parametric ReLU (PReLU)

The PReLU function is an extension of Leaky ReLU where the slope $\alpha$ for negative inputs is learned during training. This makes PReLU more flexible and adaptive compared to Leaky ReLU. The mathematical form is:

\text{PReLU}(x) = \begin{cases} x & \text{if } x \geq 0 \ \alpha x & \text{if } x < 0 \end{cases}

where $\alpha$ is a learnable parameter. The derivative of PReLU is:

\text{PReLU}'(x) = \begin{cases} 1 & \text{if } x \geq 0 \ \alpha & \text{if } x < 0 \end{cases}

PReLU is especially useful in deep networks where the optimal value of $\alpha$ can vary between neurons, thus enhancing model flexibility.

6. Exponential Linear Unit (ELU)

The ELU function aims to address the vanishing gradient problem while allowing for negative activations. It is defined as:

\text{ELU}(x) = \begin{cases} x & \text{if } x \geq 0 \ \alpha(e^x - 1) & \text{if } x < 0 \end{cases}

where $\alpha$ is a positive constant that controls the value of the output for negative inputs. The derivative of the ELU function is:

\text{ELU}'(x) = \begin{cases} 1 & \text{if } x \geq 0 \ \alpha e^x & \text{if } x < 0 \end{cases}

ELUs have been shown to speed up training and improve performance compared to ReLU, particularly in deep networks, by introducing non-zero outputs for negative inputs.

7. Softmax Function

The softmax activation function is used primarily in the output layer for multi-class classification problems. It converts the raw output of a network into a probability distribution across multiple classes. The mathematical form is:

\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^K e^{x_j}}

where $x_i$ is the output for class $i$ , and $K$ is the total number of classes. Softmax ensures that the sum of the output probabilities is equal to 1, making them interpretable as probabilities.

8. Comparison of Activation Functions

Function	Range	Non-linearity	Derivative	Strengths	Weaknesses
Sigmoid	$(0, 1)$	Yes	$\sigma(x)(1-\sigma(x))$	Smooth gradient, probabilistic output	Vanishing gradient problem
Tanh	$(-1, 1)$	Yes	$1 - \tanh^2(x)$	Zero-centered, faster convergence	Vanishing gradient problem
ReLU	$[0, \infty)$	Yes	$0 \text{ if } x < 0, 1 \text{ if } x > 0$	Computationally efficient, no vanishing gradients	Dying ReLU problem
Leaky ReLU	$(-\infty, \infty)$	Yes	$\alpha \text{ if } x < 0, 1 \text{ if } x > 0$	Solves dying ReLU problem	Requires tuning $\alpha$
PReLU	$(-\infty, \infty)$	Yes	$\alpha \text{ if } x < 0, 1 \text{ if } x > 0$	Learnable slope, adaptive	Computationally expensive
ELU	$(-\infty, \infty)$	Yes	$\alpha e^x \text{ if } x < 0$	Improves training speed, continuous negative values	Requires tuning $\alpha$
Softmax	$(0, 1)$	Yes	$\frac{e^{x_i}}{\sum e^{x_j}}$	Multi-class classification	Sensitive to outliers

Backpropagation: Derivation and Explanation

Backpropagation is the cornerstone of learning in artificial neural networks. It is a supervised learning algorithm used for training multi-layer neural networks by minimizing the error in the network’s predictions. The process involves computing gradients of the error with respect to the network’s weights and adjusting them using an optimization algorithm, typically stochastic gradient descent (SGD). In this section, we derive the backpropagation algorithm step by step, and explain its significance in neural network training.

1. Problem Setup

Consider a neural network with one hidden layer. The network receives an input $\mathbf{x} = (x_1, x_2, \dots, x_n)$ and produces an output $\hat{y}$ . The target output is denoted as $y$ . Let the weights connecting the input layer to the hidden layer be $\mathbf{W_1}$ , and the weights connecting the hidden layer to the output layer be $\mathbf{W_2}$ .

For simplicity, the activation functions in the hidden and output layers are denoted as $f_h(\cdot)$ and $f_o(\cdot)$ , respectively.

2. Forward Propagation

During the forward pass, the input is passed through the network to obtain the predicted output. For a given input $\mathbf{x}$ , the output at the hidden layer $\mathbf{h}$ is computed as:

\mathbf{h} = f_h(\mathbf{W_1} \mathbf{x} + \mathbf{b_1})

where $\mathbf{b_1}$ is the bias term for the hidden layer. The output $\hat{y}$ of the network is then computed as:

\hat{y} = f_o(\mathbf{W_2} \mathbf{h} + \mathbf{b_2})

where $\mathbf{b_2}$ is the bias term for the output layer.

3. Loss Function

The loss function measures the error between the predicted output $\hat{y}$ and the actual target $y$ . A common choice for the loss function in regression problems is the Mean Squared Error (MSE):

L = \frac{1}{2} (\hat{y} - y)^2

The factor $\frac{1}{2}$ is used to simplify the derivative calculation later. For classification problems, other loss functions such as cross-entropy can be used.

4. Backpropagation: Computing Gradients

The goal of backpropagation is to calculate the gradients of the loss function with respect to the weights $\mathbf{W_1}$ and $\mathbf{W_2}$ by applying the chain rule of differentiation. We will compute the gradient step-by-step, starting with the gradient of the loss function with respect to the output $\hat{y}$ .

4.1. Gradient of the Loss with respect to the Output Layer

We first compute the derivative of the loss function with respect to the output $\hat{y}$ :

\frac{\partial L}{\partial \hat{y}} = \hat{y} - y

Next, we compute the gradient of the loss with respect to the weights $\mathbf{W_2}$ . The chain rule tells us that:

\frac{\partial L}{\partial \mathbf{W_2}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{W_2}}

Since the output $\hat{y} = f_o(\mathbf{W_2} \mathbf{h} + \mathbf{b_2})$ , we have:

\frac{\partial \hat{y}}{\partial \mathbf{W_2}} = f_o'(\mathbf{W_2} \mathbf{h} + \mathbf{b_2}) \cdot \mathbf{h}

Thus, the gradient of the loss with respect to $\mathbf{W_2}$ is:

\frac{\partial L}{\partial \mathbf{W_2}} = (\hat{y} - y) \cdot f_o'(\mathbf{W_2} \mathbf{h} + \mathbf{b_2}) \cdot \mathbf{h}

4.2. Gradient of the Loss with respect to the Hidden Layer

Now, we compute the gradient with respect to the hidden layer. The chain rule gives us:

\frac{\partial L}{\partial \mathbf{h}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{h}}

Since $\hat{y} = f_o(\mathbf{W_2} \mathbf{h} + \mathbf{b_2})$ , we have:

\frac{\partial \hat{y}}{\partial \mathbf{h}} = f_o'(\mathbf{W_2} \mathbf{h} + \mathbf{b_2}) \cdot \mathbf{W_2}

Thus, the gradient with respect to $\mathbf{h}$ is:

\frac{\partial L}{\partial \mathbf{h}} = (\hat{y} - y) \cdot f_o'(\mathbf{W_2} \mathbf{h} + \mathbf{b_2}) \cdot \mathbf{W_2}

4.3. Gradient of the Loss with respect to the Hidden Layer Weights

Next, we compute the gradient with respect to the weights $\mathbf{W_1}$ . We again use the chain rule:

\frac{\partial L}{\partial \mathbf{W_1}} = \frac{\partial L}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial \mathbf{W_1}}

Since $\mathbf{h} = f_h(\mathbf{W_1} \mathbf{x} + \mathbf{b_1})$ , we have:

\frac{\partial \mathbf{h}}{\partial \mathbf{W_1}} = f_h'(\mathbf{W_1} \mathbf{x} + \mathbf{b_1}) \cdot \mathbf{x}

Thus, the gradient with respect to $\mathbf{W_1}$ is:

\frac{\partial L}{\partial \mathbf{W_1}} = (\hat{y} - y) \cdot f_o'(\mathbf{W_2} \mathbf{h} + \mathbf{b_2}) \cdot \mathbf{W_2} \cdot f_h'(\mathbf{W_1} \mathbf{x} + \mathbf{b_1}) \cdot \mathbf{x}

5. Weight Updates

Finally, once we have computed the gradients, we update the weights using an optimization algorithm such as stochastic gradient descent (SGD). The update rule for each weight is given by:

\mathbf{W_1} = \mathbf{W_1} - \eta \frac{\partial L}{\partial \mathbf{W_1}}

\mathbf{W_2} = \mathbf{W_2} - \eta \frac{\partial L}{\partial \mathbf{W_2}}

where $\eta$ is the learning rate.

Weight Initialization Techniques

Weight initialization is a crucial aspect of training deep neural networks. Properly initialized weights can help the model converge faster, avoid issues such as vanishing and exploding gradients, and ensure better performance. This section discusses various weight initialization techniques that have been developed to address the challenges associated with initializing weights in neural networks.

Importance of Weight Initialization

In neural networks, the initial values of weights can significantly impact the optimization process. Poor initialization can lead to slow convergence, or even failure to converge at all. This is especially problematic in deep networks, where the gradients may either vanish (in very deep networks) or explode (when gradients are too large). Thus, proper weight initialization techniques are required to ensure that the network trains effectively.

Random Initialization

In the early days of neural networks, a common practice was to initialize weights randomly, often uniformly or with a Gaussian (normal) distribution. The general idea behind random initialization is to break the symmetry in the network, ensuring that each neuron learns different features and avoids the problem of neurons learning the same features.

For a given layer $l$ , the weights $\mathbf{W^{(l)}}$ might be initialized as:

\mathbf{W^{(l)}} \sim \mathcal{U}(-\epsilon, \epsilon)

where $\mathcal{U}$ denotes a uniform distribution, and $\epsilon$ is a small constant (e.g., $\epsilon = 0.1$ ).

While random initialization is a simple approach, it can cause issues like slow convergence or poor generalization. This led to the development of more sophisticated initialization techniques.

Xavier/Glorot Initialization

The Xavier initialization (also known as Glorot initialization) was proposed to address the issue of gradients vanishing or exploding in deep networks. It takes into account the number of input and output units in a layer and scales the weights accordingly. The idea is to ensure that the variance of the activations and gradients is roughly the same across all layers.

For a layer with $n_{\text{input}}$ input neurons and $n_{\text{output}}$ output neurons, the weights are initialized from a distribution with zero mean and variance given by:

\text{Var}(\mathbf{W^{(l)}}) = \frac{2}{n_{\text{input}} + n_{\text{output}}}

In practice, this means that the weights are drawn from a uniform or normal distribution with mean zero and variance $\frac{2}{n_{\text{input}} + n_{\text{output}}}$ . The Xavier initialization works well when using the sigmoid or tanh activation functions, which are commonly used in classical neural networks.

He Initialization

He initialization, proposed by Kaiming He and colleagues, improves upon Xavier initialization by scaling the weights based on the number of input units in each layer. This technique is particularly useful when ReLU or its variants (such as Leaky ReLU or Parametric ReLU) are used as activation functions, as these activations tend to have a different behavior than sigmoid or tanh.

The weights are initialized using a normal distribution with zero mean and variance:

\text{Var}(\mathbf{W^{(l)}}) = \frac{2}{n_{\text{input}}}

where $n_{\text{input}}$ is the number of input neurons to the layer. This initialization helps prevent the gradients from vanishing and exploding by ensuring that the activations and gradients stay in a reasonable range during the forward and backward passes.

LeCun Initialization

LeCun initialization is another initialization technique, which is specifically designed for deep networks using the sigmoid or tanh activation functions. It is similar to Xavier initialization but adjusts the variance based on the number of input neurons to prevent vanishing gradients.

The weights are initialized using a normal distribution with zero mean and variance:

\text{Var}(\mathbf{W^{(l)}}) = \frac{1}{n_{\text{input}}}

where $n_{\text{input}}$ is the number of input neurons. This technique helps to balance the signal propagation through layers and mitigates issues with vanishing gradients.

Uniform vs. Normal Distribution

In practice, weights can be initialized using either a uniform distribution or a normal distribution. The key difference lies in how the distribution is defined:

Uniform Distribution: Weights are drawn from a uniform distribution between two values (e.g., $-\epsilon$ and $\epsilon$ ).
Normal Distribution: Weights are drawn from a normal (Gaussian) distribution with a specified mean and variance.

Each distribution has its own advantages. The uniform distribution is often simpler to implement and may work well for certain problems, while the normal distribution ensures that the weights are centered around zero and are more likely to cover a wider range of values.

Bias Initialization

While the initialization of weights is crucial, the initialization of biases also plays an important role in training neural networks. Biases are typically initialized to small constant values, often zero. Initializing biases to zero is generally sufficient for most cases because the bias update during backpropagation does not suffer from the same issues as weight updates (i.e., vanishing or exploding gradients). In some cases, however, small positive values are used to avoid neurons starting in the “off” state (e.g., for ReLU activations).

Bias vs. Weight Initialization for Specific Activation Functions

Different activation functions can benefit from different weight and bias initialization strategies. The table below summarizes the recommended initialization strategies for common activation functions:

Activation Function	Recommended Weight Initialization
Sigmoid	Xavier/Glorot
Tanh	Xavier/Glorot
ReLU	He Initialization
Leaky ReLU	He Initialization
Softmax	Xavier/Glorot

Batch Normalization and Weight Initialization

Batch normalization (BN) is a technique that normalizes the output of each layer, which can alleviate the problem of poor initialization by ensuring that the inputs to each layer have zero mean and unit variance. With batch normalization, the impact of weight initialization becomes less critical, as the normalization step reduces the risk of vanishing and exploding gradients. However, proper initialization still helps the network learn faster and more efficiently.

Python Implementation

In this section, we provide an implementation of a simple feedforward neural network in Python using NumPy. This implementation mirrors the structure and functionality of the neural network described in previous sections, but it is implemented using Python's powerful scientific computing library, NumPy.

The neural network we are implementing is a basic multi-layer perceptron (MLP) with one hidden layer. The network is trained using the backpropagation algorithm to solve three logical problems: XOR, OR, and AND.

Class Definition: `NeuralNet`

The NeuralNet class encapsulates the properties and behaviors of our neural network. It includes methods for training the network, performing predictions, and initializing the weights.

Initialization Method: `init`

The __init__ method is used to initialize the network's properties. It takes the input size, hidden layer size, output size, and learning rate as parameters. The weights are initialized randomly using a uniform distribution between -0.1 and 0.1.

class NeuralNet:
    def __init__(self, input_size, hidden_size, output_size, learning_rate):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.learning_rate = learning_rate

        self.weights_input_hidden = np.random.uniform(-0.1, 0.1, (input_size, hidden_size))
        self.weights_hidden_output = np.random.uniform(-0.1, 0.1, hidden_size)

Here, self.weights_input_hidden stores the weights connecting the input layer to the hidden layer, and self.weights_hidden_output stores the weights connecting the hidden layer to the output layer.

Activation Functions: `sigmoid` and `sigmoid_derivative`

The activation function used in this implementation is the sigmoid function, which is commonly used in neural networks due to its smooth gradient and the fact that its output is between 0 and 1, making it suitable for binary classification problems.

@staticmethod
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

@staticmethod
def sigmoid_derivative(x):
    return x * (1 - x)

The sigmoid_derivative function is used during backpropagation to compute the gradients of the error with respect to the weights.

Training Method: `train`

The train method implements the backpropagation algorithm to update the weights of the network. It loops over all training samples for a specified number of epochs, performs forward propagation, computes the error, and updates the weights using gradient descent.

def train(self, inputs, outputs, epochs):
    inputs = np.array(inputs)
    outputs = np.array(outputs)

    for epoch in range(epochs):
        for sample, target in zip(inputs, outputs):
            hidden_layer = self.sigmoid(np.dot(sample, self.weights_input_hidden))
            output = self.sigmoid(np.dot(hidden_layer, self.weights_hidden_output))
            error = target - output
            delta_output = error * self.sigmoid_derivative(output)

            delta_hidden = delta_output * self.weights_hidden_output * self.sigmoid_derivative(hidden_layer)
            self.weights_hidden_output += self.learning_rate * delta_output * hidden_layer
            self.weights_input_hidden += self.learning_rate * np.outer(sample, delta_hidden)

In this method:

The input is passed through the input-to-hidden weights, and the hidden layer output is computed using the sigmoid function.
The output of the network is then calculated by passing the hidden layer output through the hidden-to-output weights.
The error is computed by subtracting the network's output from the actual target value.
The error is backpropagated through the network by calculating the gradients and updating the weights accordingly.

Prediction Method: `predict`

The predict method performs forward propagation given an input sample and computes the network's output. It is used to make predictions after the network has been trained.

def predict(self, input):
    hidden_layer = self.sigmoid(np.dot(input, self.weights_input_hidden))
    output = self.sigmoid(np.dot(hidden_layer, self.weights_hidden_output))
    return output

Main Program

In the main program, we define the datasets for the XOR, OR, and AND logical operations. We then instantiate the neural network for each logical operation, train the network, and print the results after training.

if __name__ == "__main__":
    Xor = [[0, 0], [0, 1], [1, 0], [1, 1]]
    y_xor = [0, 1, 1, 0]

    Or = [[0, 0], [0, 1], [1, 0], [1, 1]]
    y_or = [0, 1, 1, 1]

    And = [[0, 0], [0, 1], [1, 0], [1, 1]]
    y_and = [0, 0, 0, 1]

    net_xor = NeuralNet(input_size=2, hidden_size=12, output_size=1, learning_rate=0.05)
    net_or = NeuralNet(input_size=2, hidden_size=12, output_size=1, learning_rate=0.05)
    net_and = NeuralNet(input_size=2, hidden_size=12, output_size=1, learning_rate=0.05)

    net_xor.train(Xor, y_xor, epochs=200000)
    net_or.train(Or, y_or, epochs=200000)
    net_and.train(And, y_and, epochs=200000)

    print("Post-training results (XOR Problem):")
    for sample in Xor:
        output = net_xor.predict(sample)
        print(f"Input: {sample} -> Output: {output:.6f} (Prediction: {int(output > 0.5)})")

    print("\nPost-training results (OR Problem):")
    for sample in Or:
        output = net_or.predict(sample)
        print(f"Input: {sample} -> Output: {output:.6f} (Prediction: {int(output > 0.5)})")

    print("\nPost-training results (AND Problem):")
    for sample in And:
        output = net_and.predict(sample)
        print(f"Input: {sample} -> Output: {output:.6f} (Prediction: {int(output > 0.5)})")

Explanation of the Output

After training the neural network for 200,000 epochs on each of the logical datasets (XOR, OR, AND), the network is expected to output the correct predictions for each input sample. The output will be close to 0 or 1, with the network's prediction being classified as 1 if the output is greater than 0.5, and 0 otherwise.

Post-training results (XOR Problem):  
Input: [0, 0] -> Output: 0.018405 (Prediction: 0)  
Input: [0, 1] -> Output: 0.982032 (Prediction: 1)  
Input: [1, 0] -> Output: 0.982620 (Prediction: 1)  
Input: [1, 1] -> Output: 0.016189 (Prediction: 0)  

Post-training results (OR Problem):  
Input: [0, 0] -> Output: 0.010663 (Prediction: 0)  
Input: [0, 1] -> Output: 0.993518 (Prediction: 1)  
Input: [1, 0] -> Output: 0.993501 (Prediction: 1)  
Input: [1, 1] -> Output: 0.999262 (Prediction: 1)  

Post-training results (AND Problem):  
Input: [0, 0] -> Output: 0.000000 (Prediction: 0)  
Input: [0, 1] -> Output: 0.009322 (Prediction: 0)  
Input: [1, 0] -> Output: 0.009421 (Prediction: 0)  
Input: [1, 1] -> Output: 0.986279 (Prediction: 1)

ANSI C Implementation

In this section, we provide an implementation of a simple feedforward neural network in ANSI C. The network includes a single hidden layer and uses backpropagation to train on logical problems like XOR, OR, and AND. The code is designed to be efficient, utilizing C's low-level memory management and computational power.

Before following the tutorial you can use your file structure as follows:

├── CMakeLists.txt
├── neural_net.c
├── neural_net.h
└── tests
    ├── test_and.c
    ├── test_or.c
    └── test_xor.c
└── build

Header File: `neural_net.h`

The neural_net.h header defines the structure and function prototypes for the neural network. The structure NeuralNet holds the configuration and weights of the network, and the functions include methods for creating the network, training, predicting, and freeing resources.

#ifndef NEURAL_NET_H
#define NEURAL_NET_H

#include <math.h>

typedef struct
{
    int input_size;
    int hidden_size;
    int output_size;
    double **weights_input_hidden;
    double *weights_hidden_output;
    double learning_rate;
} NeuralNet;

NeuralNet *create_neural_net(int input_size, int hidden_size, int output_size, double learning_rate);
void train(NeuralNet *net, double **inputs, double *outputs, int num_samples, int epochs);
double predict(NeuralNet *net, double *input);
void free_neural_net(NeuralNet *net);

#endif

input_size, hidden_size, and output_size store the dimensions of the network.
weights_input_hidden and weights_hidden_output store the weights of the network, where weights_input_hidden is a 2D array and weights_hidden_output is a 1D array.
learning_rate determines the step size in the weight update during training.

The functions declared here are responsible for initializing the network, training it using the backpropagation algorithm, making predictions, and freeing dynamically allocated memory.

Source File: `neural_net.c`

The neural_net.c file contains the function definitions that implement the neural network's functionality, including the forward propagation and backpropagation algorithms.

Helper Functions

sigmoid: The activation function used in the network.
sigmoid_derivative: The derivative of the sigmoid function, which is used during backpropagation.
random_weight: A helper function to initialize the weights with small random values.

static double sigmoid(double x)
{
    return 1.0 / (1.0 + exp(-x));
}

static double sigmoid_derivative(double x)
{
    return x * (1 - x);
}

static double random_weight()
{
    return ((double)rand() / RAND_MAX) * 0.2 - 0.1;
}

Neural Network Creation: `create_neural_net`

This function allocates memory for the neural network structure, initializes the weights randomly, and sets the network parameters (input size, hidden size, output size, and learning rate).

NeuralNet *create_neural_net(int input_size, int hidden_size, int output_size, double learning_rate)
{
    NeuralNet *net = (NeuralNet *)malloc(sizeof(NeuralNet));
    net->input_size = input_size;
    net->hidden_size = hidden_size;
    net->output_size = output_size;
    net->learning_rate = learning_rate;

    net->weights_input_hidden = (double **)malloc(input_size * sizeof(double *));
    for (int i = 0; i < input_size; i++)
    {
        net->weights_input_hidden[i] = (double *)malloc(hidden_size * sizeof(double));
        for (int j = 0; j < hidden_size; j++)
        {
            net->weights_input_hidden[i][j] = random_weight();
        }
    }

    net->weights_hidden_output = (double *)malloc(hidden_size * sizeof(double));
    for (int i = 0; i < hidden_size; i++)
    {
        net->weights_hidden_output[i] = random_weight();
    }

    return net;
}

Training Function: `train`

The train function implements the backpropagation algorithm. It updates the weights iteratively over a specified number of epochs using the training data. The function calculates the error, propagates it backward, and adjusts the weights.

void train(NeuralNet *net, double **inputs, double *outputs, int num_samples, int epochs)
{
    double *hidden_layer = (double *)malloc(net->hidden_size * sizeof(double));
    double output;

    for (int epoch = 0; epoch < epochs; epoch++)
    {
        for (int sample = 0; sample < num_samples; sample++)
        {
            for (int j = 0; j < net->hidden_size; j++)
            {
                hidden_layer[j] = 0.0;
                for (int i = 0; i < net->input_size; i++)
                {
                    hidden_layer[j] += inputs[sample][i] * net->weights_input_hidden[i][j];
                }
                hidden_layer[j] = sigmoid(hidden_layer[j]);
            }

            output = 0.0;
            for (int j = 0; j < net->hidden_size; j++)
            {
                output += hidden_layer[j] * net->weights_hidden_output[j];
            }
            output = sigmoid(output);

            double error = outputs[sample] - output;
            double delta_output = error * sigmoid_derivative(output);

            double *delta_hidden = (double *)malloc(net->hidden_size * sizeof(double));
            for (int j = 0; j < net->hidden_size; j++)
            {
                delta_hidden[j] = delta_output * net->weights_hidden_output[j] * sigmoid_derivative(hidden_layer[j]);
                net->weights_hidden_output[j] += net->learning_rate * delta_output * hidden_layer[j];
            }

            for (int i = 0; i < net->input_size; i++)
            {
                for (int j = 0; j < net->hidden_size; j++)
                {
                    net->weights_input_hidden[i][j] += net->learning_rate * delta_hidden[j] * inputs[sample][i];
                }
            }

            free(delta_hidden);
        }
    }

    free(hidden_layer);
}

This function works as follows:

Forward Propagation: The input is propagated through the network to the output.
Error Calculation: The difference between the network's prediction and the target output is computed.
Backpropagation: The error is propagated backward to update the weights of the network.

Prediction Function: `predict`

The predict function performs forward propagation using a given input and returns the network's output.

double predict(NeuralNet *net, double *input)
{
    double *hidden_layer = (double *)malloc(net->hidden_size * sizeof(double));
    for (int j = 0; j < net->hidden_size; j++)
    {
        hidden_layer[j] = 0.0;
        for (int i = 0; i < net->input_size; i++)
        {
            hidden_layer[j] += input[i] * net->weights_input_hidden[i][j];
        }
        hidden_layer[j] = sigmoid(hidden_layer[j]);
    }

    double output = 0.0;
    for (int j = 0; j < net->hidden_size; j++)
    {
        output += hidden_layer[j] * net->weights_hidden_output[j];
    }
    output = sigmoid(output);

    free(hidden_layer);
    return output;
}

Freeing Resources: `free_neural_net`

This function deallocates the memory used by the neural network, ensuring there are no memory leaks.

void free_neural_net(NeuralNet *net)
{
    for (int i = 0; i < net->input_size; i++)
    {
        free(net->weights_input_hidden[i]);
    }
    free(net->weights_input_hidden);
    free(net->weights_hidden_output);
    free(net);
}

CMake Configuration: `CMakeLists.txt`

The CMake configuration is used to build the neural network project. It specifies the C standard, the source files, and links the math library for functions like exp().

cmake_minimum_required(VERSION 3.10)
project(NeuralNet)
set(CMAKE_C_STANDARD 99)
set(SOURCES
    neural_net.c
)
include_directories(${CMAKE_SOURCE_DIR})
add_executable(run_and ${SOURCES} tests/test_and.c)
add_executable(run_or ${SOURCES} tests/test_or.c)
add_executable(run_xor ${SOURCES} tests/test_xor.c)
target_link_libraries(run_and m)
target_link_libraries(run_or m)
target_link_libraries(run_xor m)
add_custom_command(TARGET run_and PRE_BUILD
    COMMAND ${CMAKE_COMMAND} -E make_directory ${CMAKE_BINARY_DIR})
add_custom_command(TARGET run_or PRE_BUILD
    COMMAND ${CMAKE_COMMAND} -E make_directory ${CMAKE_BINARY_DIR})
add_custom_command(TARGET run_xor PRE_BUILD
    COMMAND ${CMAKE_COMMAND} -E make_directory ${CMAKE_BINARY_DIR})

The CMake configuration defines three separate executables for testing the XOR, OR, and AND logic gates.

Logical Gates: XOR, AND, OR

In this section, we apply the neural network implementation to solve basic logical gate problems: XOR, AND, and OR. These problems are often used to test the ability of a neural network to learn non-linear patterns and generalize based on training data.

XOR Problem: `test_xor.c`

The XOR (exclusive OR) problem is a classic non-linear problem where the output is true (1) only when exactly one of the inputs is true (1). This is represented as follows:

Input 1	Input 2	Output (XOR)
0	0	0
0	1	1
1	0	1
1	1	0

The neural network is trained on this data, and after training, the network should be able to predict the correct output for each combination of inputs.

XOR Test Code (`test_xor.c`)

#include <stdio.h>
#include "neural_net.h"

int main()
{
    double Xor[4][2] = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
    double y_xor[4] = {0, 1, 1, 0};
    double *inputs[4] = {Xor[0], Xor[1], Xor[2], Xor[3]};

    NeuralNet *net = create_neural_net(2, 12, 1, 0.05);

    train(net, inputs, y_xor, 4, 200000);

    printf("Post-training results (XOR Problem):\n");
    for (int i = 0; i < 4; i++)
    {
        double output = predict(net, Xor[i]);
        printf("Input: %d, %d -> Output: %f (Prediction: %d)\n",
               (int)Xor[i][0], (int)Xor[i][1], output, (output > 0.5) ? 1 : 0);
    }
    free_neural_net(net);
    return 0;
}

Explanation:

Inputs: The XOR data is stored in a 2D array Xor[4][2], where each row represents an input pair.
Expected Outputs: The target outputs are stored in y_xor[4], which corresponds to the XOR of the two inputs.
Training: The neural network is created with an input size of 2, a hidden layer size of 12, and an output size of 1. The learning rate is set to 0.05, and the network is trained for 200,000 epochs.
Prediction: After training, the network predicts the output for each input pair, and the results are printed, where the output is compared to the expected prediction (0 or 1).

OR Problem: `test_or.c`

The OR problem is another simple logical operation where the output is true (1) if at least one of the inputs is true (1). This is represented as follows:

Input 1	Input 2	Output (OR)
0	0	0
0	1	1
1	0	1
1	1	1

OR Test Code (`test_or.c`)

#include <stdio.h>
#include "neural_net.h"

int main()
{
    double X_or[4][2] = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
    double y_or[4] = {0, 1, 1, 1};
    double *inputs[4] = {X_or[0], X_or[1], X_or[2], X_or[3]};

    NeuralNet *net = create_neural_net(2, 12, 1, 0.05);
    train(net, inputs, y_or, 4, 200000);
    printf("Post-training results (OR Problem):\n");
    for (int i = 0; i < 4; i++)
    {
        double output = predict(net, X_or[i]);
        printf("Input: %d, %d -> Output: %f (Prediction: %d)\n",
               (int)X_or[i][0], (int)X_or[i][1], output, (output > 0.5) ? 1 : 0);
    }
    free_neural_net(net);
    return 0;
}

Explanation:

Inputs: The OR data is stored in a 2D array X_or[4][2].
Expected Outputs: The expected outputs for the OR operation are stored in y_or[4].
Training: Similar to the XOR problem, the neural network is trained for 200,000 epochs.
Prediction: The predictions for each input pair are printed out.

AND Problem: `test_and.c`

The AND problem is a logical operation where the output is true (1) only when both inputs are true (1). This is represented as follows:

Input 1	Input 2	Output (AND)
0	0	0
0	1	0
1	0	0
1	1	1

AND Test Code (`test_and.c`)

#include <stdio.h>
#include "neural_net.h"

int main()
{
    double X_and[4][2] = {{0, 0}, {0, 1}, {1, 0}, {1, 1}};
    double y_and[4] = {0, 0, 0, 1};
    double *inputs[4] = {X_and[0], X_and[1], X_and[2], X_and[3]};
    NeuralNet *net = create_neural_net(2, 12, 1, 0.05);

    train(net, inputs, y_and, 4, 100000);

    printf("Post-training results (AND Problem):\n");
    for (int i = 0; i < 4; i++)
    {
        double output = predict(net, X_and[i]);
        printf("Input: %d, %d -> Output: %f (Prediction: %d)\n",
               (int)X_and[i][0], (int)X_and[i][1], output, (output > 0.5) ? 1 : 0);
    }
    free_neural_net(net);
    return 0;
}

Explanation:

Inputs: The AND data is stored in X_and[4][2].
Expected Outputs: The expected outputs for the AND operation are stored in y_and[4].
Training: The network is trained for 100,000 epochs.
Prediction: After training, the predictions for each input pair are printed.

If we look at the data we obtained from all our tests, we see that we were successful:

Post-training results (XOR Problem):
Input: 0, 0 -> Output: 0.019325 (Prediction: 0)
Input: 0, 1 -> Output: 0.981115 (Prediction: 1)
Input: 1, 0 -> Output: 0.981786 (Prediction: 1)
Input: 1, 1 -> Output: 0.017010 (Prediction: 0)

Post-training results (OR Problem):
Input: 0, 0 -> Output: 0.008416 (Prediction: 0)
Input: 0, 1 -> Output: 0.994724 (Prediction: 1)
Input: 1, 0 -> Output: 0.994704 (Prediction: 1)
Input: 1, 1 -> Output: 0.999842 (Prediction: 1)

Post-training results (AND Problem):
Input: 0, 0 -> Output: 0.000000 (Prediction: 0)
Input: 0, 1 -> Output: 0.017231 (Prediction: 0)
Input: 1, 0 -> Output: 0.017089 (Prediction: 0)
Input: 1, 1 -> Output: 0.975378 (Prediction: 1)

Comparing Python and ANSI C Implementations

In this section, we perform a comprehensive comparison between the neural network implementations in Python and ANSI C, focusing on their performance and execution efficiency. Although both implementations utilize the same fundamental structure and training algorithm, the differences in their underlying programming languages yield notable contrasts in performance and runtime characteristics.

Theoretical Foundations:

The neural network model implemented in both Python and ANSI C uses a single hidden layer with a sigmoid activation function. This model is trained using a simple backpropagation algorithm, where the weights are updated based on the gradient of the error with respect to the weights. Both implementations apply the same dataset (XOR, OR, and AND logical operations) and undergo identical training regimes, making this comparison a fair assessment of the performance characteristics of both languages in the context of neural network training.

Python Implementation
Python, being a high-level language, allows for rapid development and testing. Libraries such as NumPy provide highly optimized matrix operations, but the language itself introduces some overhead due to its dynamic typing and interpreted nature. This makes Python an excellent choice for prototyping and experimentation, but less suited for performance-critical applications where low latency is crucial.
ANSI C Implementation
ANSI C, on the other hand, offers low-level access to memory and a more direct way of manipulating data. The absence of an interpreter and the use of static typing allows C to execute faster than Python in many cases. Additionally, C gives the programmer complete control over memory management, which, when utilized effectively, can result in significantly improved performance.

Empirical Performance Comparison:

To evaluate the efficiency of both implementations, we conducted time tests on the XOR, OR, and AND logical gate problems. The following results were observed:

C Implementation Results

❯ time ./run_xor; time ./run_or; time ./run_and

Post-training results (XOR Problem):
Input: 0, 0 -> Output: 0.019325 (Prediction: 0)
Input: 0, 1 -> Output: 0.981115 (Prediction: 1)
Input: 1, 0 -> Output: 0.981786 (Prediction: 1)
Input: 1, 1 -> Output: 0.017010 (Prediction: 0)
./run_xor  0,39s user 0,00s system 99% cpu 0,398 total

Post-training results (OR Problem):
Input: 0, 0 -> Output: 0.008416 (Prediction: 0)
Input: 0, 1 -> Output: 0.994724 (Prediction: 1)
Input: 1, 0 -> Output: 0.994704 (Prediction: 1)
Input: 1, 1 -> Output: 0.999842 (Prediction: 1)
./run_or  0,39s user 0,00s system 99% cpu 0,387 total

Post-training results (AND Problem):
Input: 0, 0 -> Output: 0.000000 (Prediction: 0)
Input: 0, 1 -> Output: 0.017231 (Prediction: 0)
Input: 1, 0 -> Output: 0.017089 (Prediction: 0)
Input: 1, 1 -> Output: 0.975378 (Prediction: 1)
./run_and  0,19s user 0,00s system 99% cpu 0,196 total

Python Implementation Results

❯ time python ./run.py

Post-training results (XOR Problem):
Input: [0, 0] -> Output: 0.019555 (Prediction: 0)
Input: [0, 1] -> Output: 0.980941 (Prediction: 1)
Input: [1, 0] -> Output: 0.981516 (Prediction: 1)
Input: [1, 1] -> Output: 0.017204 (Prediction: 0)

Post-training results (OR Problem):
Input: [0, 0] -> Output: 0.009163 (Prediction: 0)
Input: [0, 1] -> Output: 0.994319 (Prediction: 1)
Input: [1, 0] -> Output: 0.994298 (Prediction: 1)
Input: [1, 1] -> Output: 0.999694 (Prediction: 1)

Post-training results (AND Problem):
Input: [0, 0] -> Output: 0.000000 (Prediction: 0)
Input: [0, 1] -> Output: 0.010318 (Prediction: 0)
Input: [1, 0] -> Output: 0.010156 (Prediction: 0)
Input: [1, 1] -> Output: 0.985612 (Prediction: 1)

python run.py  41,19s user 0,02s system 101% cpu 40,771 total

Analysis of Results

From the performance results, we can make several key observations:

Execution Time
- The C implementation demonstrates significantly faster execution times compared to the Python implementation. The total time for running the XOR, OR, and AND tests in C is approximately 0.4 seconds per test, whereas the Python implementation takes around 40.77 seconds for the same set of tests. This stark difference arises primarily due to the overhead introduced by the Python interpreter and the dynamic nature of the language.
- The C version runs almost in real-time, while the Python version, despite its convenient high-level operations, incurs considerable time penalties due to the interpreted execution environment.
Optimization Potential
- The C code’s performance is primarily dependent on the optimized nature of compiled languages, where each instruction is executed directly by the hardware. The absence of an interpreter and the ability to manipulate memory directly enables C to leverage hardware capabilities to a greater extent.
- On the other hand, Python’s ease of use comes at the cost of slower execution, especially when handling computationally intensive tasks like training a neural network. However, this trade-off is acceptable for smaller-scale problems or for environments where developer productivity and ease of debugging are more critical.
Scalability
- As the complexity of the neural network increases (e.g., adding more layers or increasing the dataset size), the performance gap between Python and C becomes even more pronounced. While Python can still be used effectively for small-scale models, for large-scale models or time-sensitive applications, C provides a far more efficient alternative.

If you want to access the source code, you can contact me. I can
upload it to a repo, but I'm pretty sure that all you'll want to do is
copy and paste the code from there and then forget how it's coded. So
you should try to do it yourself. All the code is already in the
article. But if you have any questions, especially if you have
problems with compilation or errors, definitely contact me.

Theoretical Foundations of Neural Networks

1. Neural Network Architecture

2. The Neuron Model

3. Activation Functions

4. Forward Propagation

5. Loss Functions

6. Backpropagation and Gradient Descent

7. The Universal Approximation Theorem

8. Challenges and Limitations

Why Implement Neural Networks in ANSI C and Python?

1. Pedagogical Value

2. Full Control and Customization

3. Performance Insights

4. Bridging Theory and Practice

5. Cross-Language Expertise

6. Learning from First Principles

7. Preparing for Edge Computing

8. Encouraging Innovation Beyond Frameworks

Mathematical Foundations

Linear Algebra for Neural Networks

1. Representing Data with Vectors and Matrices

2. Weight Matrices and Linear Transformations

3. Activation Computations Using Dot Products

4. Matrix Multiplication for Layer-Wide Computations

5. Tensor Algebra in Deep Networks

6. Eigenvalues and Singular Value Decomposition (SVD)

7. Numerical Stability and Conditioning

Activation Functions

1. Sigmoid Activation Function

2. Hyperbolic Tangent (tanh)

3. Rectified Linear Unit (ReLU)

4. Leaky ReLU

5. Parametric ReLU (PReLU)

6. Exponential Linear Unit (ELU)

7. Softmax Function

8. Comparison of Activation Functions

Backpropagation: Derivation and Explanation

1. Problem Setup

2. Forward Propagation

3. Loss Function

4. Backpropagation: Computing Gradients

4.1. Gradient of the Loss with respect to the Output Layer

4.2. Gradient of the Loss with respect to the Hidden Layer

4.3. Gradient of the Loss with respect to the Hidden Layer Weights

5. Weight Updates

Weight Initialization Techniques

Importance of Weight Initialization

Random Initialization

Xavier/Glorot Initialization

He Initialization

LeCun Initialization

Uniform vs. Normal Distribution

Bias Initialization

Bias vs. Weight Initialization for Specific Activation Functions

Batch Normalization and Weight Initialization

Python Implementation

Class Definition: NeuralNet

Initialization Method: __init__

Activation Functions: sigmoid and sigmoid_derivative

Training Method: train

Prediction Method: predict

Main Program

Explanation of the Output

ANSI C Implementation

Header File: neural_net.h

Source File: neural_net.c

Helper Functions

Neural Network Creation: create_neural_net

Training Function: train

Prediction Function: predict

Freeing Resources: free_neural_net

CMake Configuration: CMakeLists.txt

Logical Gates: XOR, AND, OR

XOR Problem: test_xor.c

XOR Test Code (test_xor.c)

Explanation:

OR Problem: test_or.c

OR Test Code (test_or.c)

Explanation:

AND Problem: test_and.c

Class Definition: `NeuralNet`

Initialization Method: `init`

Activation Functions: `sigmoid` and `sigmoid_derivative`

Training Method: `train`

Prediction Method: `predict`

Header File: `neural_net.h`

Source File: `neural_net.c`

Neural Network Creation: `create_neural_net`

Training Function: `train`

Prediction Function: `predict`

Freeing Resources: `free_neural_net`

CMake Configuration: `CMakeLists.txt`

XOR Problem: `test_xor.c`

XOR Test Code (`test_xor.c`)

OR Problem: `test_or.c`

OR Test Code (`test_or.c`)

AND Problem: `test_and.c`

AND Test Code (`test_and.c`)