How a neural network works
Before we understand how much of the thinking can be done in a neural network, we need to examine how it works from a computational model perspective. The idea is to build a computer that can program itself by simply looking at the input-output pairs of the program it is supposed to run. Why would we want to do that? Suppose we have a problem of reading handwritten text. We know that there must be a program that can read handwritten text, because we can do it, and we assume that the Turing hypothesis states that any computation can be performed on a Turing Machine. But we do not know how to write this program. But have examples of text that are both handwritten and in ASCII for samples; so we would like our machine to program itself by simply consuming the examples we have.
It is difficult to write such a program in Turing’s model of computation as a machine that changes its state, and it is easier to do so with the circuit model. Any computation can be written as a Boolean circuit with NAND gates or with AND and NOT gates. We don’t even mind having more gates at our disposal. Therefore, if there is a program that can read handwritten text, there is likely a corresponding Boolean circuit that can perform the same task. Let us consider that the handwritten text is an image in 2-color black-white form with one bit per pixel. Each of these pixels is an input wire into the circuit. The output can be a series of bits representing the text in binary format. Now, there must be some circuit that performs this computation; we need to learn about it. Therefore, we need to create a machine that features programmable gates and connections between them, similar to an FPGA, or a field-programmable gate array. However, we also only consider a layered circuit – i.e., the gates are arranged in layers and the inputs of a layer can only come from the layer below it. Any circuit can be modified to take this form by allowing more gates than the original unrestricted form of circuit. This way it is easier to provide a programmable circuit. We create a programmable layered circuit so that there is a possible link between every gate in a layer to the input of every gate in the layer above it. These links can be turned on by programming. We must also allow the programming of the gates themselves to take up any truth table. With these two types of programmability, we can produce any layered circuit.
But automatically programming this discrete programmable circuit is hard. On the other hand, a continuous function can be optimized by using gradient descent. So we relax our gates to be continuous, where the wires can take any value between 0 and 1. Given two inputs, we can design all gates using the following formula —
\[
g(x,y) = f(ax+by+c)
\]
where the function \( f(z) \) is defined to be 0 for \( z < 0 \) and 1 for \( z \ge 0 \). The following table shows how to choose the constants \( a, b, c \) to get different Boolean gates.
Gate | a | b | c |
---|---|---|---|
AND | 1 | 1 | -2 |
OR | 1 | 1 | -1 |
NAND | -1 | -1 | 1.5 |
NOT | -1 | 0 | 0.5 |
Note that in case of a NOT gate, we just ignore the second input because a NOT gate has only one input. Please verify that the truth tables match for each gate. However, even this construct cannot be used for gradient descent since \( f \) is not continuous. So, we do continuous relaxation of the \( f(z) \) as \( \frac{e^z}{1+e^z} \) and call it the sigmoid function. In case of a layered circuit with programmable connections, we define a gate by considering all the possible inputs. We can use \( \mathbf{x}_i \) to consider all the wires in the layer \( i \). Then we can write –
\[
g(\mathbf{x}) = f(\mathbf{a} \cdot \mathbf{x}) + c
\]
where \( \mathbf{a} \) is the vector of parameters and \( c \) is the parameter called the bias. If we allow \( \mathbf{x} \) to always have a 1 as an extra dimension, we can make \( c \) a part of \( \mathbf{a} \). Then we can write —
\[
g(\mathbf{x}) = f(\mathbf{a} \cdot \mathbf{x})
\]
If we consider all the gates, we can have a vector \( \mathbf{a}_j \) for the \( j^{th} \) gate. Altogether, we can write —
\[
\mathbf{x}_{i+1} = g(\mathbf{A}_i\mathbf{x}_i)
\]
where \( \mathbf{A}_i \) is a matrix, the \( j^{th} \) row of which is \( \mathbf{a}_j \). We can now determine the matrices \( \mathbf{A}_i \) through gradient descent. The idea I described above is exactly what a feed-forward multi-layer perceptron (MLP) is. Now we understand why neural networks are effective for every problem — the only requirements are to have a sufficiently large network and to reach the global minimum during gradient descent.
Note that from this perspective, it does not matter much what your base model is. As long as it can 1. model all computation under some computational bound, and 2. It can be trained, i.e., the difference from the data can be minimized with an efficient algorithm; it can be trained to do exactly the same thing. The two important problems that stop us from basically making a perfect machine are the difficulty of training. Firstly, the optimization problem is NP-Hard, so we take different heuristic routes to nearly optimize it instead. Secondly, we must have enough data to specify our target function well enough and with enough detail. Too small an amount of data cannot provide enough information to specify all the information we need to learn a function effectively. Therefore, various activation functions, connection topologies, and other modifications, such as attention networks, have been effectively employed to solve numerous learning problems. The important thing is whether it can be trained efficiently, and all modifications are geared towards making the training more efficient in terms of both computation and data.
Are there other models?
Yes, lots of them. In fact, when we use the transformer models used in the large language models, it is a prime example. The attention heads are a different model from what we described above; however, LLMs also contain MLPs, as we described above. The point of using other systems is to be able to map specific types of computation (such as next-word prediction, artificial vision, and fingerprint recognition) more effectively with a smaller network than a multilevel perceptron would be able to map those functions. When we have a smaller network, it is much easier to train. It requires less memory, avoids many problems with training, and generalizes better, as it learns from the minor fluctuations or errors in the training data due to the much smaller number of parameters. I will list a few more notable examples –
- Higher order neural network- You have layers of linear transformation, tensor product, or the input with itself. So, we have a matrix \( W \) of size \( n \times (n+1)^2 \) such that \( x_l = W (\{x_{l-1}|1\} \otimes \{x_{l-1}|1\}) \) where the input where \( | \) is concatenation and \( l \) is the layer. When stacked in many layers, it makes a polynomial of the input. We can even have higher order tensor products. The problem lies in the size of the matrix.
- Convolutional Neural Network- This is used for image processing. In image processing, it makes sense to first compute local features and then compose them into global features. So a small network is applied on overlapping windows of the image for processing and the concatenation of the output vectors are then passed onto the global network.
- Recurrent Neural Network- Used to be applied to text processing, it applies the same network over and over again to produce both an output and a hidden state. The hidden state is used as part of the input in the next step. The idea is to use the hidden state as a working memory between multiple applications of the same network. It is still used in conjunction with LLMs to extend their context window.
There are plenty of other such networks. One thing to understand is that all these modifications are generally to facilitate optimization and represent specific types of functions in fewer parameters. As such, any of these can learn any function given enough size, training data, and compute power.
What stops it from making AGI?
There are primarily three hurdles to overcome when approximating any computation problem-
- Size of the model- Any neural network is a bounded circuit, i.e., depending on the size of the computation, you might need bigger and bigger models. Bigger models can hold more intricate patterns, thus improving prediction. However, a bigger model requires more computational resources to hold and process. We are indeed reaching the limits of computational power and continually overcoming them. No one is sure at what size it becomes AGI. However, using the scaling laws and comparing them with human performance, the current estimation indicates that the computational power required is trillions of times greater than the current capabilities if we use the current models of LLMs.
- Size of the data- To specify an intricate function, you need a lot of data. In this context, the Kolmogorov complexity is of importance. The Kolmogorov complexity of a string is the minimum size of a program written in a specific language ( or for a specific Universal Turing Machine) that can generate the string. If it is a string with a simple structure like \( “1, 10,11,…,1111_{m\ \text{times}}” \), it takes a very small program to generate. On the other hand, for a random string, the program size would be slightly larger than the size of the string (it would literally have to print the string). The Kolmogorov complexity of a function on a finite domain is the same as the Kolmogorov complexity of the string that lists outputs of the function in order, like \( f(0),f(1),f(2),… \) for all inputs in the input domain. Notice that for a simple logical function, this is also small. Now, think about how the model and the data are two different representations of the same function. So, the data cannot express a function of higher Kolmogorov complexity. In practice, the data required to train a model is several times more than the size of the model because data contains errors; thus, it needs redundancy to represent the function properly. As such, we are using the whole internet’s data to train the frontier models.
- Optimization- The optimization of a highly non-convex function is NP-hard, i.e., there is no efficient algorithm to truly reach the optimal parameters. We can, however, reach near-optimal or at least good-enough parameter values by variations of gradient descent. But this is more of a heuristic exercise than a proven correct algorithm. We try to specialize models to optimize with the available computational power.
Summary
Neural networks are Turing equivalent, or more precisely, bounded Turing equivalent, just like the computers we use in day-to-day life. They can do all possible computation up to a specific size. A laptop’s computational power in terms of what computation it can possibly do is bounded by its memory (that includes its SSD). In case of a neural network, its computational power is limited by the number of learnable parameters it has. This bound is a major issue in modeling all sorts of learning tasks. We are unsure how large a model we need to build an AGI, but it should be very large. If we use the current methods, we can compute from the scaling laws that an AGI would need LLMs that are trillions of times larger than the ones we currently use. This is clearly infeasible. That is why researchers continually find new ways to create models that are more efficient in learning, both in terms of computational power and the data required for learning.