A few months ago, I started studying deep learning, a branch in computer science heavily inspired by biology that allows programs to learn arbitrary concepts without explicit programming.

How good is deep learning? Well, good enough to play Go with super-human performance, a significant milestone in artificial intelligence research since the techniques used 20 years beforehand to beat the Chess world champion Garry Kasparov were not good enough for Go despite both being turn-based strategy games with complete information. (For reference, there are less than 5 x 1046 legal Chess board positions, while there are more than 2 x 10170 legal Go board positions.)

To my surprise, I found that the basic concepts of deep learning were relatively simple. Even more surprisingly, I felt that this young science contained some major flaws that could be improved upon. Considering this, I decided to fully understand them, and try to reinvent this science in my own way to see if my approaches would be better.

For my first post on this topic, I will explain what deep learning is, how it works and why it is successful in developing powerful forms of artificial intelligence for specific tasks that weren't possible until recently.

Oh, and don't expect a vulgarization of the concept with some pretty metaphors. We're about to go into the nitty-gritty details here with all the juicy mathematics that comes with it. Expect some linear algebra, calculus and a little bit of statistics.

The general problem

Let's say we have a function that takes as input a vector of real values of a predefined length, and returns as output a vector of real values also with a predefined length.

y = f ( x ) vec y = f(vec x)

However, the definition of this function is unknown. The only information that we have about it is observed data in the form of a list of examples of inputs and their corresponding outputs, with the possible additional restriction that we may not be able to remember the whole list due to limited memory.

The objective is to come up with a new function that will attempt to simulate the unknown function as best as possible, and to possibly improve its definition as new examples become known.

s ( x ) f ( x ) s(vec x) approx f(vec x)

The main issues with tacking this problem are as follows:

  • How should the simulated function be modeled to approximate an arbitrary function?
  • How should the simulated function be modified to optimize its results with full information of past examples?
  • How should the simulated function be modified to optimize its results with incomplete information of past examples?

Simulated function model

Instead of using a complicated function model as a base for our simulated function, deep learning instead use a simple function model, but then combine a bunch of them to achieve a highly-complex result.

Let's start by having our simulated function as a series of simpler, embedded vector functions of real numbers.

s ( x ) = d l ( d l 1 ( ... ( d 2 ( d 1 ( x ) ) ) ) ) s(vec x) = d_l(d_{l-1}(...(d_2(d_1(vec x)))))

Note that the number of embedded functions, and the number of real variables in each vector connecting embedded functions, are completely arbitrary. Depending on the type of problem, we can simply add or remove them to change the efficiency and/or accuracy of the simulated function. For example, we may want to simulate an identity function and use a intermediate vector of a smaller length than the input/output to force data compression and find correlations in the input.

The issue now is to define a good function model for our embedded functions to be configurable and powerful enough to scale in this way.

d i ( x ) = ? d_i(vec x) = ?

In terms of configuration, an easy solution would be to multiply the vector elements with a matrix of configurable real numbers, called weights, and have the dimensions of this matrix match the desired input and output lengths.

W i = [ w i 11 w i 12 w i 21 w i 22 ] d i ( x ) = W i x Τ W_i = left [ matrix{w_i11 # w_i12 # dotsaxis ## w_i21 # w_i22 # dotsaxis ## dotsvert # dotsvert # dotsdown} right ] newline d_i(vec x) = W_i {vec x}^%TAU

To ensure that the product does not necessarily equal a zero matrix if the input is a zero vector, we could also force an extra element to the input vector always equal to 1 called bias, and expend the weights matrix accordingly.

d i ( x ) = W i [ x Τ 1 ] d_i(vec x) = W_i left [ matrix{{vec x}^%TAU ## 1} right ]

However, those are not sufficient. The problem is that we cannot do better than a linear combination, as the whole stack of embedded functions could be reduced into a single one. This is not powerful enough to simulate arbitrary functions for our needs. Fortunately, a way to fix this is to apply an additional activation function immediately before returning the output to introduce non-linearity.

d i ( x ) = A ( W i [ x Τ 1 ] ) A ( x ) = [ a ( x 1 ) a ( x 2 ) a ( x 3 ) ... ] d_i(vec x) = A left ( W_i left [ matrix{{vec x}^%TAU ## 1} right ] right ) newline A( vec x ) = left [ matrix{a(x_1) # a(x_2) # a(x_3) # ...} right ]

This is where the theory gets fuzzy. Just the Wikipedia page about activation functions lists more than 20 possibilities previously considered in scientific research.

a ( x ) = ? a(x) = ?

Interestingly, as of this writing, one of the most popular activation function appears to be the rectified linear unit, which returns the input if it is positive and 0 otherwise. Despite forcing values to be positive and having weight improvements disrupted when reaching the constant range, this simple activation function breaks linearity, is quick to compute and achieves very good experimental results.

a ( x ) = ReLU ( x ) = max ( 0 , x ) a(x) = func ReLU(x) = func max(0,x)

Minimizing errors

In order to optimize our simulated function, we need a way to measure errors against the results of the original function so that we can attempt to minimize them. We will need to define an error function in order to do so, using the outputs of the original mystery function and corresponding simulated function as inputs of this error function.

e ( f , s ) = ? e(f,s) = ?

Defining said error function is surprisingly not trivial. I will go more in-depth on this matter in my next post on machine learning, but for now I would like to at least mention that they are based around the distance between the real and simulated output vectors, and that a common error function is to sum up the squares of those distances and divide it by 2.

e ( f , s ) = i = 1 n | s ( x i ) f ( x i ) | 2 2 e(f,s) = sum from{i=1} to{n} {lline s(vec x_i) - f(vec x_i) rline ^ 2} over 2

In any case, after settling on an error function, we are then able to start thinking about how to optimize it. Obviously we would like it to be equal to 0, but this is not possible in the general case, so what we are going to attempt instead is to be as close to 0 as possible. Assuming no structural changes will be made in our final function model, we would be able to attempt this by changing its weights.

Fortunately, calculus gives us a way to simplify the work. The local minimums of the error function can only be found where its gradient in relation to each configurable variable are undefined or equal to 0. Similarly, gradients point in the direction of the steepest ascent, so by following the function's curve in the opposite direction, one can eventually find a local minimum.

This allows for a technique called gradient descent, where the value of the error function is minimized by updating the weights towards the opposite direction of the gradient of the weights by a predefined learning factor. A higher learning factor means less examples are required, but the resulting simulated function has a higher risk of being more unstable and less accurate. As for the initialization of weights prior to the first gradient descent, it is traditionally random although close to 0.

e ( f , s ) = [ e ( f , s ) w 1 e ( f , s ) w 2 ] w ' = w l e ( f , s ) nabla e(f,s) = left [ matrix{{partial e(f,s)} over {partial w_1} ## {partial e(f,s)} over {partial w_2} ## dotsvert} right ] newline vec w' = vec w - l nabla e(f,s)

One point to consider is that gradient descent does not require remembering all previously-observed data. Indeed, we just need to consider one pair of real and simulated outputs per application of this technique. This solves the limited memory problem.

Another point to consider is that when using gradient descent with the simulated function model described above, we do not need to optimize its error function directly. Instead, we can optimize the error function for each embedded functions separately, which simplifies calculations. The idea is to first perform forward propagation, i.e. calculate the simulated output normally for a given example input, then perform backward propagation which is done in the following manner:

  1. Perform gradient descent for the topmost embedded function
  2. Calculate the input that should have been passed to the embedded function to achieve the expected output
  3. Perform gradient descent for the embedded function immediately underneath by using the result of the previous step as the expected output for it
  4. If the last tweaked embedded function is not the lowermost one, go to step 2

Unfortunately, there are non-negligible drawbacks of using gradient descent and backward propagation in this manner:

  • It requires a very large amount of computational power.
  • It is possible that the optimization will tend to a sub-optimal local minimum instead of the global minimum.
  • The number of iterations required for improving an embedded function may be different depending on its position in the chain and the activation function used.
  • A ridiculously high number of examples is required to achieve good results. (I've heard claims by some notable researchers in the field of approximately 5000 examples per input category for acceptable results and 10000000 examples per input category for super-human results.)

Where's the biology in all that?

I find that getting inspiration from biology for deep learning was a good idea, but that one should not be stuck following its model as it prevents creative improvements from emerging, hence why I'm only talking about it at the very end.

From my understanding (I'm not a biology expert, so don't quote me on that), biological neural networks behave similarly to what I described above. Neurons are stacked in layers and with all neurons of a specific layers connected to neurons of adjacent layers. During forward propagation, these neurons shoot electric pulses at a specific frequency within a range from nothing to some physical maximum, based on the electric pulses received as input. During backward propagation, the strength of these connections are corrected to better match the real output.

This background may help you understand some of the terminology that you may find when researching deep learning. Here is a summary of such common terms vs what I used in this post:

  • Neural network: Simulated function
  • Input layer: Original input augmented with bias
  • Hidden layer: Embedded function
  • Output layer: Simulated output
  • Neuron: Function that returns a single element of the output vector of an embedded function

But can we do better?

As I mentioned earlier, one of the main reasons I did this exercise of understanding deep learning is to try to improve on its concepts. At some point in the future, I will explain on my blog my methodology in attempting to do so along with preliminary results. I hope you will be looking forward to it!

References

The main introductory resources I've consulted before writing this post are:

For those interested in a full in-depth course on deep learning, I highly recommend: