Notes on: Klambauer, G\"unter, Unterthiner, T., Mayr, A., & Hochreiter, S. (2017): Self-Normalizing Neural Networks

Table of Contents


Q & A

Why does this solve the issue of vanishing gradient and exploding weights?

  • Exploding gradient => large variance, but this ensures unit-variance hence we cannot have exploding gradients

Consider the update to a single weight:


which leads to the following update in the output of the following layer:


Now, if we're assuming that we can preserve the zero-mean across a layer, i.e.


then the variance of the layer is given by


We now observe that the change in variance due to the update of the weights gives us


from this we can see that gradient of each of the weights are constrained by the equation above. This sum will contain positive terms involving the square of the gradient, thus "regularizing" the gradient update.

This is not a proper "proof", but rather a justification which helps convince myself that this is true.