Neural Networks Breakdown Part III
To start this section, we have to go back a bit.
You should know how the whole neural network is initialized, so that we can start all the computations, such as forward propagation calculation and back-propagation weight adjustments.
All the weights connected between the nodes are randomly initialized for the first iteration. The whole neural network is computed in the first iteration, and as there was random initialization, there is a huge probability that it would be not be equal to the expected output. Hence, the weight should be adjusted to get it closer to the actual output. A quick example, say the expected output is 1 and the output from the neural network is 0.5. That means the loss in this particular case is <b>(1-0.5)<sup>2</sup></b> (the loss function in Part II), which is equal to 0.25. Now for the next iteration, the weights would be adjusted so that the loss is reduced and less than 0.25.
The question is, how are weights adjusted so as to reduce loss? This is where the derivatives part comes in. Before we step into this, we have to learn a few more basics to make the journey more smoother. I know we haven't got to this, and have to re-route a couple of times, but that is just the way it is!
<img src="/static/nnp3ex.png" alt="Neural network explanation"></img>
The image shows how all the layers are acting as inputs for the next layer, and in the end giving the final output. The first and last layers are called input and output layers respectively. The layers in between are called the hidden layers.
In this, we have two hidden layers. For the first hidden layer, we have the x<sub>1</sub> to x<sub>4</sub> as inputs. a<sub>11</sub> to a<sub>15</sub> act as inputs for the hidden layer 2, and so on. a<sub>11</sub> to a<sub>15</sub> were the outputs for each of the corresponding nodes, of the first hidden layer. If there are more than 5 nodes, the general output can be represented as a<sub>1i</sub>, where i indicates the i<sup>th</sup> node in the first hidden layer. This rule can be replicated for the other hidden layers.
The W1, W2, and W3 are the weights. W1 corresponds to the connections between the input layer and the first hidden layer. W2 corresponds to connections between the first hidden layer and the second hidden layer. W3 corresponds to the connections between the second hidden layer and the output layer. There are multiple connections from layer to layer. Hence, all these values are stored in matrices. So W1, W2, and W3 are matrices which hold the values of each weight.
So, as stated in part I, the output of any node is calculated by doing the following:
<center><font size="+2"><b>z<sub>11</sub> = Σw<sub>ji</sub>x<sub>i</sub></b></font></center>
where i runs from 1 to 4, as there are 4 inputs in the first layer. This is an example of just the summation, which we call z<sub>11</sub>, for which the output is calculated. The output is calculated by putting z, which was the summed up value, through an activation function. A common activation function is the sigmoid function.
<center><img src="/static/sgm.png" alt="Sigmoid function"></img></center>
The x in the function will be z, the summed value which will enter the function, to give the final output of the node, which will be represented by a<sub>11</sub>, for this example. All the outputs are calculated in the same manner.
Even though, we didn't get to the update part, this information will help in easing the understanding process for this step. The next part will continue with the same over <a href="/post/37">here</a>.
- Shubham Anuraj, 3:14AM, 20 July, 2018