“We don’t know what it is doing or how it arrives at the decision it does.”
This is really common statement to here from researchers, companies, and technology writers when describing deep learning models. That seems paradoxical: How can we create something where we can’t describe its function?
This characterization of neural networks can be a little theatrical. The reality is that neural networks mostly lack determinism — not understanding. In mathematics, computer science and physics, a deterministic system is a system in which no randomness is involved in the development of future states of the system. A deterministic model will thus always produce the same output from a given starting condition or initial state. One simple example is converting from centimeters to inches - 1cm is 0.393701 inches, the conversion is consistent and predictable. A non deterministic system would be you deciding whether to eat tacos or burgers for dinner. This is something that the decision for the most part, feels random and something we can’t model, or can we?
Neural Networks lack determinism because the way they make decisions feels randomly, because like us, they don’t always arrive at a decision with a formula as simple as converting between units of measure. However, If one really wanted to, they could very well unpack the entire network, visualize it, and know exactly how it arrived at its decision. The problem is that even for a simple multi node neural network, that could be a lifetime’s work and the results of that work would also be meaningless. So, let’s find out why, and then let’s explore how this simple base structure gave rise to more advanced networks that like the ones we observe in generative AI today.
To understand how a neural network works, we’re going to try and understand a couple concepts:
But, before we do that, let’s first understand the biological process that Neural Networks were designed to replicate: the human nerve cell, or, neuron.
Here’s a quick refresher on basic biology: Neurons are specialized cells in the human body interconnected at both ends; the have a head, body and tail. Neurons are arranged in a chain head to tail, with multiple neurons connecting to each other; meaning they aren’t just connected in a line. The head of a neuron receives an electrical stimulus from the tails of any neurons that precede it. The preceding neurons only pass an electrical signal if they have been “activated”. Once activated they propagate the signal down its tail to the next neuron connected at the head. This transmission of electrical signals around our body is what enables us to think, run, see, and basically live. Your body is a essentially a live wire with electricity constantly flowing around it. But if that’s so, why is that our muscles aren’t contracting constantly, and our brains aren’t paralyzed with stimulus? This is because neurons need to receive a certain level of electrical impulse to “activate.”
In the body, sodium and potassium ions can be moved in and out of cell to create was called an electrochemical gradient. This imbalance of molecules creates an electrical charge that can be used to activate a neuron. This activation is triggered by what’s called an “action potential”—which initiates an electrical impulse in the cell. Whenever a neuron receives an electrical stimulation that is strong enough to raise the electrical gradient of the membrane above it’s resting potential line (-70mv), the neuron becomes activated and the electrical imbalance surges upwards (to about +30mv) creating a wave that then propagates down the body and tail, through its connecting dendrites to the next neuron. What can be observed from the graph of an action potential is that this signal is all or nothing— it’s on or off, it’s a 1 or a 0, on. If the stimulus coming from the pervious nerve cells isn’t strong enough, then nothing happens. For those who understand how transistors work, this concept should be very familiar.
Another thing to note is that this electrical wave also weakens the further it travels down the chain - just like electricity in a wire. Given that the signal has a consistent peak voltage, the strength of the stimulus provided by just a single neuron to the next one isn’t always enough on its own to activate the next neuron. This is why it can take a few neurons connected to the head of the next neuron in chain to produce a stimulus strong enough to trigger that neuron’s action potential.
We can mathematically describe this transmission process. First of all we need to think of Neurons as being in layers. Lets say there were 4 Neurons connected in a linear sequence, each individual neuron represents a layer in the network that has just one node in it; so this would be a 4 layer network. Since multiple Neurons can connect to many other Neurons, a single layer can have multiple nodes in it; picture it in your head like apartment buildings standing next to each other with varying amounts of floors. The floors represent the nodes and the apartment buildings themselves are a single layer each. The floors are connected to each other between buildings with lines that can transmit a signal, but only in the forward direction. It’s kind of like a game of telephone where calls move from one apartment building to the next via calls from an apartment on each floor. This is how the signal moves from input to output in a neural network.
Neurons that precede the current cell we’re trying to activate will transmit a value that we call a weight (this is a number) to the next node (w1 -> Wn)- again, only if they’ve been previously activated. Because more than one neuron can be connected in chain we use whats called a vector to represent all the weights being passed from the previous node to the receiving node. A vector is a set of numbers and in this case the vector would represent the weights being passed from previously activated nodes to the next node. If a previous node is active, its weight is put in place for its position in the vector, if not, a 0 is put to signal its not active. If a connected node isn’t activated, then it doesn’t propagate its weight forward. Once a node receives a vector an activation function preforms a mathematical operation on the vector to product a single numeric value. This value is then compared to the activation value of the node - which is called its bias. If the numeric output of the activation function exceeds the nodes bias, the node becomes active and will then propagate its weight to the next node in the next layer. In a nerve cell this is the process of a cell receiving electrical stimulus from a bunch of other nerve cells (weights being passed). The cell then sums up the strength of the entire signal received and if its strong enough (exceeds its bias value) it becomes active and propogates its electrical signal forward to the next nerve cell.
So how does this simple structure of lighting up nodes in layer combined to do something useful? Well lets look at an example to undestand better.
Lets say I have a neural network designed to identify a dog. My simple network has a few layers where each layer is responsible for detecting some specific features of a dog on route to determining the input picture I gave it is a dog. Maybe the first layer lights up node for a section of colour that is somewhat consistent signalling a shape is here (at this stage it could be a chair, a car, an animal etc), the next layer lights up nodes for what represent the edge outline of something roughly the shape of animal, the following layer sees 4 legs and a tail, the next layer gets even more specific lighting up nodes for a snout and floppy ears etc.. As the signal moves through the network less and less nodes are activated in each layer as the elements in the image become more and more specific. Eventually as it gets to the end only a single output node lights up for.. “dog”. This whole process is represented with numbers in neural network in the values of the weights and basis of the nodes.
So hopefully now you can visualize the structure of a neural network and what is doing, but how does it learn? This is where back propagation comes into play.
Back propagation was a technique written on by Geoffrey Hinton, David Rumelhart, and Ronald Williams in 1986. This is often why Hinton is credited as being the “godfather of AI.”
This technique is the process in training that alters the values of the weights and biases in a neural network. When training a neural network using supervised learning, we can compare the output of the network with the expected output from our training set (i.e. we show it a picture of a dog that is labeled dog, so the output knows to expect: a dog).
If the output is what we desired (i.e. the image of dog was predicted to be a dog), we back propagate along that path taken to produce that output and “strengthen” the weights and biases assigned to the path taken to produce that output. This might mean increasing the weights transmitted from some nodes to others or lowering the bias of nodes so that they activate more easily in that specific chain. This is whats happening when we are training neural networks, we are adjusting the numbers of layers, the number of nodes in a layer, and also the values of the weights and biasis of the nodes.
When the output isn’t correct, we devalue the signals on the path take to produce that output. Over time, we start to strengthen pathways for desired outputs as a result. This is very similar to the experience of learning as a human. The more you do something, the better you seem to get at it because your body is strengthening those neural connections on the pathway it took to:
This is why you often hear: “We don’t know what it’s doing”—this is because we don’t bother to observe the change in these values as we train. However, that doesn’t mean you can’t go node by node, layer by layer, and observe the values and the activation pathway.
For a simple network with two input nodes, two hidden layers, and two output nodes, this is very easy to do. The problem is when you have billions of nodes (like GPT3.5, which has 175 billion weights and 96 layers) after millions of training rounds, it becomes futile to measure these observations just to feel some sense of understanding… but if we wanted to, we could. We just trust the foundations we set up will lead us to the outcomes we want.
The concept of neural networks and many of the techniques and concepts outlined below have existed even before the transistor, but they didn’t find much use until compute power could catch up in the 2000s.
From there, a flurry of tools such as tensor flow and pytorch helped increase accessibility to the neural network algorithms by removing significant barriers to entry (i.e. actually coding a neural network and various types of cost functions for training).
Along the way, many adaptations to the base network infrastructure were introduced that helped both improve and specialize some of its functions. Let’s outline some of these because they’re important fundamentals in understanding the current AI wave we’re experiencing today.
These networks first started to see commercial applications in the mid-2000s in the application of natural language processing (NLP). RNNs allow connections between the output of a network and its input creating a cycle feed back loop. This allows neural networks themselves, not just the nodes, to be chained together in sequence in whats referred to as a feed forward neural network structure.
Long short-memory (LSTM) networks, which can be considered a subset of RNNs first invented in 1997, saw significant improvements to speech recognition. These networks introduced the ability for neural networks to remember and were foundational to most NLP work of the 2000s and early 2010s. Some of the applications of RNN’s were grammar correction tools or tools for improved writing and simple language recognition.
Though conceived in the 1980s, Convolutional Neural Networks (CNNs) are given this name because it uses a process called a convolution to analyze images. Think of a convolution like having a filter that slides across an image to gather important information about the pixels as it does so. This technique introduced a significant leap forward in computer vision because not only did it enhance the edge detection but also made it possible to identify objects with higher precision.
One of the most influential CNN’s was AlexNet designed by Alex Krizhevsky along with Ilya Sutskever (CSO at OpenAI) and Hinton. Its significant success at image classification has resulted in one of the most widely cited papers in AI. It was also one of the first times GPUs were used for image recognition. CNNs would go on to advance the field of self-driving cars and autonomous robots with their superior ability for object recognition in real time.
Introduced in 2014 by Ian Goodfellow, these networks introduced the ability for neural networks to generate media instead of just understanding it. Unlike previous neural network innovations, GANs represented a network of networks. The two primary features were:
The PGANs paper published by NVIDIA in 2016 demonstrated a significant advance of the technique in generating photorealistic images of faces.
Diffusion models are foundational to modern solutions such as Midjourney, Dall-e, and Stability AI. A major milestone in research was published in 2015 by Sonderby from the University of Copenhagen. Additional advances in the technique were published in 2020. These models use a stochastic process of generating images from random noise and using incremental steps to remove the noise (diffusion) while slowly converging on the desired image.
Transforms are at the center of the latest AI innovations we’re experiencing today. First introduced by a Google in a paper titled “Attention is all you need,” transformers also represented a network of networks. Among the many features they introduced, the one they’re primarily known for is how the enable neural networks to understand the context of an entire sentence like a human would, not just sequentially (one word at a time) like RNN’s did.
What this means is that transformers could relate parts of the input data to other parts not linearly connected. As a human, you don’t understand a sentence one word at a time. You understand the sentence by understanding the relationship between words in all parts of a sentence to form meaning—this is “paying attention” to the words and their relationships.
This is why they have driven significant advances in language models, though they are also being applied with great success to computer vision as well.
BERT (Bidirectional Encoder Representations from Transformers) were first introduced in October 2018 by Google and predominantly improved the encoding side of a transformer (the part that is responsible for understanding).
GPT (Generative Pre-Trained Transformers) were introduced by OpenAI in June 2018, and they’re primarily responsible for improving the decoding side of a transformer (the part that is responsible for generating). Both are foundational to Large Language Models.
NeRF
Neural Radiance fields (NeRF), like the rest of the predecessors, is a network of networks. This technique, published by NVIDIA researchers in 2019, is able to generate 3D representations of object scenes represented as 2D images. NeRF has helped with pushing forward the emerging field of digital twins for infrastructure for generating computer assets from real-life objects.
Contrastive Language Image Pretraining (CLIP) was first published by OpenAI researchers in January 2021, and this technique is what is responsible for text-to-image processing. When a model is said to be multimodal, it is because it is utilizing clip for text to image and likely stable diffusion for image generation.
There will be more networks to come, but what can be observed from that timeline is how complicated each new advance is. From developing simple single networks with specific adaptations, researchers are producing complex networks of specialized neural networks to execute on the many natural human functions we aim to replicate.
Not only are these networks exceedingly complicated to develop, it takes significantly more researchers to do so. From AlexNet (which was one individual supervised by two individuals—albeit significant individuals in AI) to CLIP (which was published by an organization and co-authored by 12 people), the advances we seek are increasingly more difficult to acquire. Note that we are only discussing networks, not even fine-tuning or training techniques.
Today is an exciting time: Language models are like watching your newborn learn to talk for the first time. The future is coming at as a ferverent pace but understanding the basic building block of a neural network will help ground how you process each new innovation.