Activation Functions in Neural Networks
An activation function is an important part of a neural network. It decides whether a neuron should be activated or not on basis of weighted sums of weights and neurons. There are a few important and commonly used activation functions like sigmoid, tanh, relU, etc. But how do we decide which activation function to use for our neural network?
Here we will discuss the definitions of different neural networks and their advantages and disadvantages over the other activation functions. We won't be discussing neural networks in detail here because we already have blogs on that which can be read here.
<h2><strong><center>Different Neural Networks</h2></strong></center>
There are different neural networks, but some of the common functions that we will discuss are RelU, Sigmoid, and Tanh. The activation function used in hidden layers can be different from that used for the output layer. One of the most famous output activation functions is Softmax which we will discuss as well.
A sigmoid function is an S-shaped curve. It returns the value between 0 and 1. It was quite popular earlier but not there are better alternatives than sigmoid. Sigmoid is defined as:
<img src="https://i.ibb.co/N6ZVxhq/sigmoid.png" alt="sigmoid" border="0" />
Sigmoid has the lowest gradient amongst the functions we will discuss here, which is why sigmoid-based activation functions are slow to "learn". Sigmoid might also give non-zero-centered output that causes the gradient updates to propagate in varying directions. The derivative of the sigmoid is bell-curved and hence NN with sigmoid activation function can also suffer from <strong>vanishing gradient problem</strong>. It means a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small and it will become hard to train the neural network with a small derivative.
<img src="https://i.ibb.co/M8vfz6w/vanishing-gradient.png" alt="vanishing-gradient" border="0" />
The hyperbolic tangent function, a.k.a., the tanh function, is another type of AF. It is a smoother, zero-centered function having a range between -1 to 1. As a result, the output of the tanh function is represented by:
<img src="https://i.ibb.co/jvJDRpx/tanh.png" alt="tanh" border="0" />
Tanh has been observed to have better performance than sigmoid because of its steep gradient and hence it is used more than sigmoid nowadays. Tanh doesn't have the problem of non-zero-centered output which is another advantage it has over sigmoid. Though tanh too suffers from the vanishing gradient problem as it too squishes a very large domain between -1 and 1.
RelU stands for Rectified Linear Unit. It is the most commonly used activation function in modern neural networks. RelU is more computationally efficient and has been observed to give better results in comparison to other activation functions. It is defined as:
<img src="https://i.ibb.co/DghZwf1/relu.png" alt="relU" border="0" />
Apart from being computationally efficient, RelU is susceptible to vanishing gradients that can occur in the case of a sigmoid. Derivative of RelU is either 0 or 1 and hence it doesn't cause the problem of vanishing gradients. As RelU has a higher gradient than the other activation functions, its learning speed is faster as well.
The softmax activation function is different from all other functions here. It is only used for getting the output. This function generates an output that ranges between values 0 and 1 and with the sum of the probabilities being equal to 1. The softmax function is represented as follows:
<img src="https://i.ibb.co/KjX657P/softmax.png" alt="softmax" border="0" />
The reason why we use it for the output layer is that it gives the output as the probability of the categories rather than any number between 0 and 1 for all categories. It more discretely tells which category has the highest chance of being the correct output (according to the neural network).
One problem that might occur in RelU-based neural networks is that it can lead to dead neural neurons as it reduces the value of negatively weighted sums to 0 even though they might have something important to contribute to the network. Though there exists a solution to this problem. We can use Leaky RelU to eradicate this problem.
<h2><strong><center>When to Use Which?</h2></strong></center>
The most common approach in modern-day neural networks is to use RelU activation functions for the hidden layers and softmax for getting the output. Most of the <strong>Convolutional neural networks and ANNs use relU activation function</strong>. Though it doesn't mean other sigmoid and hyperbolic functions are not used nowadays.
The tanh function has been mostly used in <strong>recurrent neural networks for natural language processing and speech recognition tasks.</strong>. Though sigmoid was used in earlier models, it is hard to find them in modern neural networks because of their slow rate of learning. Though sigmoid and tanh are preferred for classifiers sometimes. They are generally used in recurrent neural networks (RNNs).
Some models use different activation functions for different layers. There are other activation functions as well. Google researchers introduced a newer activation function called <strong>Swish</stromg> which is seen to perform better than relU. It is defined by x * sigmoid(x). A linear function can also be used as an activation function.
- Ojas Srivastava, 10:32 PM, 08 Jun, 2022