A few months ago, I started to study deep learning, a branch in computer science heavily inspired by biology that allows programs to learn arbitrary concepts without explicit programming.
How good is deep learning? Well, good enough to play Go with super-human performance, a significant milestone in artificial intelligence research since the techniques used 20 years beforehand to beat the Chess world champion Garry Kasparov were not good enough for Go despite both being turn-based strategy games with complete information. (For reference, there are less than 5 x 1046 legal Chess board positions, while there are more than 2 x 10170 legal Go board positions.)
To my surprise, I found that the basic concepts of deep learning were relatively simple. Even more surprisingly, I felt that this young science seemed to contain some major flaws that could be improved upon. Considering this, I decided to fully understand them, and try to reinvent this science in my own way to see if my approaches would be better.
For my first post on this topic, I will explain what deep learning is, how it works and why it is successful in developing powerful forms of artificial intelligence for specific tasks that weren't possible until recently.
Oh, and don't expect a vulgarization of the concept with some pretty metaphors. We're about to go into the nitty-gritty details here with all the juicy mathematics that comes with it. Expect some linear algebra, calculus and a little bit of statistics.
The general problem
Let's say we have a function that takes as input a vector of real values of a predefined length, and returns as output a vector of real values also with a predefined length.
However, the definition of this function is unknown. The only information that we have about it is observed data in the form of a list of examples of inputs and their corresponding outputs, with the possible additional restriction that we may not be able to remember the whole list due to limited memory.
The objective is to come up with a new function that will attempt to simulate the unknown function as best as possible, and to possibly improve its definition as new examples become known.
The main issues with tacking this problem are as follows:
- How should the simulated function be modeled to approximate an arbitrary function?
- How should the simulated function be modified to optimize its results with full information of past examples?
- How should the simulated function be modified to optimize its results with incomplete information of past examples?
Simulated function model
Instead of using a complicated function model as a base for our simulated function, deep learning instead use a simple function model, but then combine a bunch of them to achieve a highly-complex result.
Let's start by having our simulated function as a series of simpler, embedded vector functions of real numbers.
Note that the number of embedded functions, and the number of real variables in each vector connecting embedded functions, are completely arbitrary. Depending on the type of problem, we can simply add or remove them to change the efficiency and/or accuracy of the simulated function. For example, we may want to simulate an identity function and use a intermediate vector of a smaller length than the input/output to force data compression and find correlations in the input.
The issue now is to define a good function model for our embedded functions to be configurable and powerful enough to scale in this way.
In terms of configuration, an easy solution would be to multiply the vector elements with a matrix of configurable real numbers, called weights, and have the dimensions of this matrix match the desired input and output lengths.
To ensure that the product does not necessarily equal a zero matrix if the input is a zero vector, we could also force an extra element to the input vector always equal to 1 called bias, and expend the weights matrix accordingly.
However, those are not sufficient. The problem is that we cannot do better than a linear combination, as the whole stack of embedded functions could be reduced into a single one. This is not powerful enough to simulate arbitrary functions for our needs. Fortunately, a way to fix this is to apply an additional activation function immediately before returning the output to introduce non-linearity.
This is where the theory gets fuzzy. Just the Wikipedia page about activation functions lists more than 20 possibilities previously considered in scientific research.
Interestingly, as of this writing, one of the most popular activation function appears to be the rectified linear unit, which returns the input if it is positive and 0 otherwise. Despite forcing values to be positive and having weight improvements disrupted when reaching the constant range, this simple activation function breaks linearity, is quick to compute and achieves very good experimental results.
In order to optimize our simulated function, we need a way to measure errors against the results of the original function so that we can attempt to minimize them. We will need to define an error function in order to do so, using the outputs of the original mystery function and corresponding simulated function as inputs of this error function.
Defining said error function is surprisingly not trivial. For now I would like to at least mention that they are based around the distance between the real and simulated output vectors.
A common error function is the mean squared error, which is traditionally defined as the sum of the squares of those distances divided by 2. The division by 2 is optional but simplifies calculations later on.
In any case, after settling on an error function, we are then able to start thinking about how to optimize it. Obviously we would like it to be equal to 0, but this is not possible in the general case, so what we are going to attempt instead is to be as close to 0 as possible. Assuming no structural changes will be made in our final function model, we would be able to attempt this by changing its weights.
Fortunately, calculus gives us a way to simplify the work. The local minimums of the error function can only be found where its gradient in relation to each configurable variable are undefined or equal to 0. Similarly, gradients point in the direction of the steepest ascent, so by following the function's curve in the opposite direction, one can eventually find a local minimum.
This allows for a technique called gradient descent, where the value of the error function is minimized by updating the weights towards the opposite direction of the gradient of the weights by a predefined learning factor, which may be a constant or a function of the number of observations. A higher learning factor means less examples are required, but the resulting simulated function has a higher risk of being more unstable and less accurate. As for the initialization of weights prior to the first gradient descent, it is traditionally random although close to 0.
One point to consider is that gradient descent does not require remembering all previously-observed data. Indeed, we can estimate the error function by considering only a subset of pairs of real and simulated outputs, which is good enough to apply this technique. I won't go into the details, but one such common approximation is called stochastic gradient descent, which picks random data points instead of using the entire set. This solves the limited memory problem.
Another point to consider is that when using gradient descent with the simulated function model described above, we do not need to optimize its error function directly. Instead, we can optimize the error function for each embedded functions separately, which simplifies calculations. The idea is to first perform forward propagation, i.e. calculate the simulated output normally for a given example input, then perform backward propagation which is done in the following manner:
- Perform gradient descent for the topmost embedded function
- Calculate the input that should have been passed to the embedded function to achieve the expected output
- Perform gradient descent for the embedded function immediately underneath by using the result of the previous step as the expected output for it
- If the last tweaked embedded function is not the lowermost one, go to step 2
Note that this technique may not work properly while weights are equal as they would remain equal afterwards. One simple workaround for this is to initialize weights with random values to break symmetry.
In addition, there is the possibility that the gradient may suddenly become so large in a certain direction that backward propagation would cause the weights to diverge instead of converging. One way to mitigate this issue is to make sure that the size of the gradient is capped before updating the weights, although that can also decrease the gradient's effenciency if the cap is set too low.
Unfortunately, despite all of this, there are still non-negligible drawbacks of using gradient descent and backward propagation in this manner:
- It requires a very large amount of computational power.
- It is possible that the optimization will tend to a sub-optimal local minimum instead of the global minimum.
- The number of iterations required for improving an embedded function may be different depending on its position in the chain and the activation function used.
- A ridiculously high number of examples is required to achieve good results. (I've heard claims by some notable researchers in the field of approximately 5000 examples per input category for acceptable results and 10000000 examples per input category for super-human results.)
Where's the biology in all that?
I find that getting inspiration from biology for deep learning was a good idea, but that one should not be stuck following its model as it prevents creative improvements from emerging, hence why I'm only talking about it at the very end.
From my understanding (I'm not a biology expert, so don't quote me on that), biological neural networks behave similarly to what I described above. Neurons are stacked in layers and with all neurons of a specific layers connected to neurons of adjacent layers. During forward propagation, these neurons shoot electric pulses at a specific frequency within a range from nothing to some physical maximum, based on the electric pulses received as input. During backward propagation, the strength of these connections are corrected to better match the real output.
This background may help you understand some of the terminology that you may find when researching deep learning. Here is a summary of such common terms vs what I used in this post:
- Neural network: Simulated function
- Input layer: Original input augmented with bias
- Hidden layer: Embedded function
- Output layer: Simulated output
- Neuron: Function that returns a single element of the output vector of an embedded function
I implemented the above model in Python with PyTorch, which you read in more details in my follow-up post Deep Learning in Python With PyTorch - Tutorial and Demo!
The main introductory resources I've consulted before writing this post are:
- Stanford University - Deep Learning Tutorial
- Nicholas McCrea - An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples
- Ivan Vasilev - A Deep Learning Tutorial: From Perceptrons to Deep Networks
- Xiu-Shen Wei - Must Know Tips/Tricks in Deep Neural Networks
For those interested in a full in-depth course on deep learning, I highly recommend:
Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, and presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…
JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary. I won't go into the details as to how one could…
Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So…
The following is a compilation of pretty much every single mathematical topic that I learned throughout my life, covering topics from all levels of education, along with external links for each of them for quick reference. I have compiled this list after extracting all of the relevant information…
While there is no denying that Scrum revolutionized the software industry for the better, it may seem a little strange to read about someone that dislikes it despite strongly agreeing with the Agile Manifesto, considering the creator of Scrum was one of its signers. However, after having experienced…
See all of my articles