Notes on: Klambauer, G\"unter, Unterthiner, T., Mayr, A., & Hochreiter, S. (2017): Self-Normalizing Neural Networks

Table of Contents

Video

Q & A

Why does this solve the issue of vanishing gradient and exploding weights?

  • Exploding gradient => large variance, but this ensures unit-variance hence we cannot have exploding gradients

Consider the update to a single weight:

klambauer17_self_normal_neural_networ_091db675a2b026df76ec7dc9f7a7948116a18a7e.png

which leads to the following update in the output of the following layer:

klambauer17_self_normal_neural_networ_06079646438795c50c0e56b51634996a6cb2b716.png

Now, if we're assuming that we can preserve the zero-mean across a layer, i.e.

klambauer17_self_normal_neural_networ_87bf23ad49caf3dd452e56b2561dd0c4a16b94be.png

then the variance of the layer is given by

klambauer17_self_normal_neural_networ_79039db0caf5c3babfc239a92c71f0f09c2bdd20.png

We now observe that the change in variance due to the update of the weights gives us

klambauer17_self_normal_neural_networ_4bb7b8a698c0cc0452577c0c5124450a1463ec31.png

from this we can see that gradient of each of the weights are constrained by the equation above. This sum will contain positive terms involving the square of the gradient, thus "regularizing" the gradient update.

This is not a proper "proof", but rather a justification which helps convince myself that this is true.