Deep Learning - ANN - Artificial Neural Network - Weight Initialization Techniques Tutorial

How to Initialize Weight?

1] Initialize the parameters

2] Choose an optimization algorithm

3] Repeat these steps:

Forward propagate an input
Compute the cost function
Compute the gradients of the cost with respect to parameters using backpropagation.
Update each parameter using the gradients, according to the optimization algorithm

The problem occurs if the incorrect weight initialized is-

1] Vanishing Gradient Problem

2] Exploding Gradient Problem

3] Slow Convergence

What not to do?

1] Not to initialize weight with zero

Case 1] Lets Take A₁₁ and A₁₂ as a Relu Activation Function

A₁₁ = max(0, Z₁₁)

Z₁₁= W¹₁₁X₁ + W¹₂₁X₂ + b₁₁

A₁₂ = max(0, Z₁₂)

Z₁₂= W¹₁₂X₁ + W¹₂₂X₂ + b₁₂

Now assume W=0, b=0. therefore Z₁₁ = 0 and Z₁₂= 0. Due to this A₁₁ and A₁₂both will become zero.

As A₁₁ and A₁₂are zero due to which $\frac{\delta L}{\delta W^1_{11}}$ is also becomes zero.

W¹₁₁_(new) = W¹_11(old) - $\eta\frac{\delta L}{\delta W^1_{11}}$

Therefore,W¹₁₁_(new) = W¹_11(old)and No training will take place

Zero Initialization Relu Practical

Case 2] Lets Take A₁₁ and A₁₂ as a Tanh Activation Function

A₁₁ = $\frac{e^{Z_{11}} - e^{-Z_{11}}}{e^{Z_{11}} + e^{-Z_{11}}}$

As Z₁₁ = 0. Due to this A₁₁ will become zero. Similarly, A₂₂ will also become zero

As A₁₁ is zero due to which $\frac{\delta L}{\delta W^1_{11}}$ is also becomes zero.

W¹₁₁_(new) = W¹_11(old) - $\eta\frac{\delta L}{\delta W^1_{11}}$

Therefore,W¹₁₁_(new) = W¹_11(old)and No training will take place

Case 3] Lets Take A₁₁ and A₁₂ as a Sigmoid Activation Function

A₁₁ = $\sigma(Z_{11}) =0.5$

On Zero Sigmoid value is 0.5. Similarly, A₂₂ will also become 0.5

In this, weights W¹₁₁ and W¹₁₂ will act as a single weight and W¹₂₁ and W¹₂₂ will act as a single weight and both single weights will go into one neuron. In short, if there is 1000 neuron then all 1000 neuron act as a single neuron

Therefore, it will not be able to capture non-linearity and it will act or behave like a perceptron

W¹₁₁_(new) = W¹_11(old) - $\eta\frac{\delta L}{\delta W^1_{11}}$

Therefore,W¹₁₁_(new) = W¹_11(old)and No training will take place

Zero Initialization Sigmoid Practical

2] Not to initialize weight with a constant non-zero value

Let's assume all weight and bias are non-zero constants i.e. 0.5.

A₁₁ = max(0, Z₁₁) → Non-Zero

Z₁₁= W¹₁₁X₁ + W¹₂₁X₂ + b₁₁ → Non-Zero

A₁₂ = max(0, Z₁₂) → Non-Zero

Z₁₂= W¹₁₂X₁ + W¹₂₂X₂ + b₁₂ → Non-Zero

therefore, Z₁₁ = Z₁₂ and A₁₁= A₁₂

In this, if you put multiple neurons in the single hidden layer, it will act as a single neuron and is not able to capture non-linearity. The model will remain linear.

3] Not to initialize weight with a small random value(like 0.0007)

If we take the small random value as a weight and bias like 0.0007 then Z will also get small and close to zero after differentiation on increasing the hidden layer.

And you will get a Vanishing Gradient Problem. It is extreme in the case of Tanh, and in the case of Sigmoid and Relu training will be slow.

4] Not to initialize weight with a large random value(0 to 1)

If we take the large random value as a weight and bias from 0 to 1 then Z will also get very large (i.e. more saturating value) after differentiation on increasing the hidden layer.

And you will get a Slow Convergence and the worst case will be a Vanishing Gradient Problem. It is extreme in the case of Tanh, and in the case of Sigmoid and Relu, it is a non-saturating function in a positive direction.

Xavier/Glorat and HE weight initialization

What can be done to avoid problems in weight initialization?

we need random weight as per the requirement in a good range or with a good variance which should not so less and not so large

Xavier/Glorat Initialization
He weight Initialization

Xavier Initialization, or Glorot Initialization - When working with Tanh or Sigmoid

It is an initialization scheme for neural networks. Biases are initialized be 0 and the weights Wij at each layer are initialized as:

$W_{ij}\sim U[-\frac{1}{\sqrt n},\frac{1}{\sqrt n}]$

The main idea behind Xavier Initialization is to set the initial values of the neural network's weights in such a way that they facilitate the training process and help prevent issues like vanishing or exploding gradients. Proper weight initialization is essential because it can significantly affect the convergence and performance of a neural network.

The Xavier Initialization method sets the initial weights for a given layer with a uniform or normal distribution based on the size of the layer's input and output. The specific formulas for Xavier Initialization differ between the uniform and normal distributions:

For a normal distribution:

The weight initialization formula for Xavier Initialization Normal distribution is np.random.randn(no. of input, no. of node) x $\sqrt\frac{1}{n}$ OR $\sqrt\frac{1}{fan-in}$

Where U is a uniform distribution and n is the number of inputs coming to the nodes

If the value of no. of input(n) is greater then the weight will be lesser, If the value of no. of input(n) is lesser then the weight will be greater. This will help to keep normal distribution balanced

For a uniform distribution:

The weight initialization formula Xavier Initialization Uniform distribution in the range [-x, x] where x is $\sqrt\frac{6}{fan-in + fan-out}$

fan_in is the number of input units to the layer.

fan_out is the number of output units from the layer.

Xavier Weight Initialization Practical

Xavier Initialization aims to keep the weights in a range that balances the signal flowing forward and backward through the network. It helps in addressing the challenges of training deep neural networks by ensuring that the weights are initialized with suitable values, reducing the likelihood of vanishing or exploding gradients during backpropagation.

Keep in mind that Xavier Initialization is just one of several weight initialization techniques available, and the choice of initialization method may depend on the architecture and activation functions used in your neural network. It's often recommended to experiment with different initialization methods to find the one that works best for your specific task.

HE/ Kaiming initialization - When working with Relu

For a normal distribution:

It is an initialization method for neural networks that takes into account the non-linearity of activation functions, such as ReLU activations.

The weight initialization formula He Initialization Normal is np.random.randn(no. of input, no. of node) x $\sqrt\frac{2}{n}$ OR $\sqrt\frac{2}{fan-in}$

n is the number of inputs coming to the nodes

For a uniform distribution:

The weight initialization formula He Initialization Uniform distribution in the range [-x, x] where x is $\sqrt\frac{6}{fan-in}$

fan_in is the number of input units to the layer.

fan_out is the number of output units from the layer.

For HE Normal Initializer-

Four Kernel Initializer are he_normal, he_uniform, glorat_normal and glorat_uniform(default)

Deep Learning - ANN - Artificial Neural Network - Weight Initialization Techniques Tutorial

About Fresherbell

Important Links

Social Media