Neural Networks

Table of Contents

Neural Networks

If you ever want to deduce the backpropagation-algorithm yourself, DO NOT ATTEMPT TO DO IT USING MATRIX- AND VECTOR-NOTATION!!!!

It makes it sooo much harder. If you check out the Summary you can see the equations in matrix- and vector-form, but these were deduced elementwise and then transformed into this notation due to the efficient nature of these operations rather than elementwise operations.

Backpropagation

Notation

  • $y^L$ is our prediction / final output
  • $y$ is what it's supposed to be, i.e. the real output
  • $y^l = a(z^l)$ the output of the $l^\text{th}$ layer
  • $a(z^l) = \sigma(z^l)$, in this case, not necessary in general
  • $\sigma(z^l) = \frac{1}{1 + e^{-z^l}}$
  • $z^l = W^l \cdot a^{l-1} + b^l$
  • $W^l$ weight matrix between the $(l - 1)^\text{th}$ and $l^\text{th}$ layer
  • $W_j^l$ is the $j^\text{th}$ row of $W^l$ s.t. $z_j^l = W_j^l \cdot a^{l - 1} + b_j^l$

Method

This my friends, is backpropagation.

We have some loss-function $\mathcal{L}$, which we set to be the least-squares loss just to be concrete:

\begin{equation*}
\mathcal{L} = \frac{1}{2} (y^L - y)^2
\end{equation*}
1

We're interested in how our loss-function for some prediction changes wrt. to the weights in all the different layers, right?

We start at the back, with the error of the last layer $delta^L$ :

\begin{equation*}
\begin{split}
\delta_j^L &= \frac{\partial \mathcal{L}}{\partial z_j^L} \\
&= (y_j^L - y_j) \frac{\partial y_j^L}{\partial z_j^L} \\
&= (y_j^L - y_j) \frac{\partial a_j^L}{\partial \sigma(z_j^L)} \frac{\partial \sigma(z_j^L)}{\partial z_j^L} \\
&= (y_j^L - y_j) \sigma' (z_j^L)
\end{split}
\end{equation*}
2

We then let $\nabla_a \mathcal{L} = (y^L - y)$ (dropping the subscript, as we now consider a vector of outputs), and write

\begin{equation*}
\delta^L = \nabla_a \mathcal{L} \odot \sigma' (z^L)
\end{equation*}
3

Where the Hadamar-product is simply a element-wise product.

Next, we need to obtain an expression for the error in the $l^{\text{th}}$ layer as a function of the next layer, i.e. the $(l+1)^{th}$ layer.

Consider the error $\delta_j^l$ for the $j^{\text{th}}$ activation / neuron in the $l^{\text{th}}$ layer.

\begin{equation*}
\begin{split}
\delta_j^l &= \frac{\partial \mathcal{L}}{\partial z_j^l} \\
&= \sum_k \frac{\partial \mathcal{L}}{\partial {z_k^{l+1}}} \frac{\partial z_k^{l+1}}{\partial z_j^l} \\
&= \sum_k \delta_k^{l+1} \frac{\partial z_k^{l+1}}{\partial z_j^l}
\end{split}
\end{equation*}
4

With this recursive relationship we can start from the back, since we already know $\delta^L$, and work our way to the actual input to the entire network.

Still got to obtain an expression for $\frac{\partial z_k^{l+1}}{\partial z_j^l}$ ! $z_k^{l+1}$ is simply given by all the

\begin{equation*}
\begin{split}
z_k^{l+1} = \sum_i w_{ki}^{l+1} a_i^l + b_i^{l+1}
\end{split}
\end{equation*}
5

Taking the derivative of this wrt. $z_j^l$

\begin{equation*}
\frac{\partial z_k^{l+1}}{\partial z_j^l} = w_{kj}^{l+1} \sigma' (z_j^l) + b_j^{l+1}
\end{equation*}
6

due to $\frac{\partial}{\partial z_j^l} (w_{ki}^{l+1} a_i^l (z_i^l) + b_i^{l+1}) = 0, \quad \forall i \ne j$.

Substituting back into the expression for $\delta_j^l$

\begin{equation*}
\delta_j^l = \sum_k \delta_k^{l+1} (w_{kj}^{l+1} \sigma'(z_j^l) + b_j^{l+1})
\end{equation*}
7

And finally rewriting in matrix-form:

\begin{equation*}
\delta^l = ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)
\end{equation*}
8

So, we now have the following expressions:

\begin{equation*}
\begin{split}
\delta^L &= \nabla_a \mathcal{L} \odot \sigma' (z^L) \\
\delta^l &= ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \\
\end{split}
\end{equation*}
9

We have our recursive relationship between the errors in the layers, and the error in the final layer, allowing us to compute the errors in the preceding layers using the recursion.

But the entire reason for why all this is interesting is that we want to obtain an expression for how to update the weights $W^l$ and biases $b^l$ in each layer to improve (i.e. reduce) these errors!

That is; we want some expressions for $\frac{\partial \mathcal{L}}{\partial w_{jk}^l}$ and $\frac{\partial{ \mathcal{L}}}{\partial b_j^l}$.

\begin{equation*}
\begin{split}
\frac{\partial \mathcal{L}}{\partial w_{jk}^l} &= \frac{\partial \mathcal{L}}{\partial z_j^l} \frac{\partial z_j^l}{\partial w_{jk}^l} \\
&= \delta_j^l \frac{\partial }{\partial w_{jk}^l} \Big( \sum_i w_{ik}^l a_k^{l-1} + b_i^l \Big) \\
&= \delta_j^l a_k^{l-1}
\end{split}
\end{equation*}
10

Let's turn this into vector-notation for each row in $W^l$.

\begin{equation*}
\frac{\partial \mathcal{L} }{\partial w_j^l} = 
\begin{bmatrix}
\frac{\partial \mathcal{L}}{\partial w_{j1}^l} & \dots & \frac{\partial \mathcal{L}}{\partial w_{jK}^l }
\end{bmatrix}
= \delta_j^l (a^{l - 1})^T, \quad \text{which is } {1 \times K}
\end{equation*}
11

And finally a full-blown matrix-notation:

\begin{equation*}
\frac{\partial \mathcal{L} }{\partial W^l} = 
\begin{bmatrix}
\frac{\partial \mathcal{L}}{\partial w_{11}^l} & \dots & \frac{\partial \mathcal{L}}{\partial w_{1K}^l } \\
\frac{\partial \mathcal{L}}{\partial w_{21}^l} & \dots & \frac{\partial \mathcal{L}}{\partial w_{2K}^l } \\
\vdots \\
\frac{\partial \mathcal{L}}{\partial w_{J1}^l} & \dots & \frac{\partial \mathcal{L}}{\partial w_{JK}^l } \\
\end{bmatrix}
=
\begin{bmatrix}
\delta_1^l a_1^{l-1} & \dots & \delta_1^l a_K^{l-1} \\
\delta_2^l a_1^{l-1} & \dots & \delta_2^l a_K^{l-1} \\
\vdots \\
\delta_J^l a_1^{l-1} & \dots & \delta_J^l a_K^{l-1} \\
\end{bmatrix}_{J \times K}
= \delta^l (a^{l-1})^T
\end{equation*}
12

And from the second-to-last line in the previous equation we see that if we instead take the partial-derivative wrt. $b_j^l$ we obtain

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial b^l} = \delta^l
\end{equation*}
13

Summary

And we end up with the following equations, using matrix notation:

    \begin{equation*}
    \begin{split}
\delta^L &= \nabla_a \mathcal{L} \odot \sigma' (z^L) \\
\delta^l &= ((W^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \\
    \frac{\partial \mathcal{L}}{\partial W^l} &= \delta^l (a^{l-1})^T \\
    \frac{\partial \mathcal{L}}{\partial b^l} &= \delta^l \\
    \end{split}
    \end{equation*}
14

Convolutional Neural Networks (CNN)

Local connectivity

Each hidden unit look at small part of image, i.e. we have small window / "receptive field" which each hidden unit ("neuron") looks at. In image-recognition, each neuron looks at a different "square" of the image.

local_connectivity.png

This means that each hidden unit will have one weight / connection for each pixel or data-point in the receptive field.

In the case where each pixel or data-point is of multiple dimensions, we will have a connection from each of these dimension, i.e. for N dimensional data-points we have N * (area of receptive field) weights / connections.

Why

  • Fully connected hidden layer would have an unmanageable number of parameters
  • Computing the linear activations of the hidden units would be computational very expensive

Parameter sharing

Share matrix of parameters / weights across certain hidden units. That is; we define a feature map which is a set of hidden units that share parameters. All hidden units in the same feature map are looking at separate parts of the image. Typically the hidden units in a feature map together covers the entire image. This is also referred to as a filter I believe.

Notation

  • channel is the "data-point" which can be of multiple dimensions. Used because of the typical input for an image being RGB-channels.
  • $i^{\text{th}}$ input channel, specifies which dimension of the input / "data-point" we're considering
  • $j^{\text{th}}$ feature map
  • $x_i$ is the $i^{\text{th}}$ input channel
  • $W_{ij}$ is the matrix connecting the $i^{\text{th}}$ input channel to the $j^{\text{th}}$ feature map, i.e. with RGB-channels $W_{12}$ corresponds to the red-channel ($i = 1$) and 2nd feature map ($j = 2$)
  • $k_{ij}$ is the convolution kernel (matrix)
  • $\overset{\sim}{X}$ means $X$ with rows and columns flipped
  • $k = \overset{\sim}{W}$
  • $g_j$ is the learning factor
  • $y_j$ is the hidden layer
  • $*$ convolution operation
  • $\underline{*}$ convolution operation with zero-padding
  • $a$ is a (usually non-linear) activation function, e.g. sigmoid, ReLU and tanh.
  • $y_j = g_j a \Big(\underset{i}{\sum} k_{ij} * x_i\Big)$, $g_j$ is not always used

Why

  • Reduces the number of parameters further
  • Each feature map or filter will extract the same feature at each different position in the image. Feature are equivariant.

Discrete convolution

Why do we use it in a Convolution Network?

We have a connection between each input (channel) and hidden unit in a feature map. We want to compute the element-wise multiplication between the input matrix and the weight-matrix, then sum all the entries in the resulting matrix.

If we flip the rows and columns of the weight-matrix, this operation corresponds to taking the convolution operation between this flipped matrix and the inputs.

Why do we want to do that? Efficiency. The convolution operation is something which is heavily used in signal processing, and so we can easily take advantage of previous techniques for computing this product.

This is really why we use the convolution operation, and why it's called, well, a convolutional network.

Pooling / subsampling hidden units

  • Performed in non-overlapping neighborhoods (subsampling)
  • Aggregate results from this neighborhood

Maximum pooling

Take the maximum value found in this neighborhood

Average pooling

Compute the average of the neighborhood.

Subsampling

  • Generalization of average pooling
  • Average pooling with learnable weights for each filter map

Back-propagation

For some loss-function $l$ we can use back-propogation to compute the total loss for a prediction.

Here we are only working on a single input at the time. Generalizing to multiple inputs would simply be to also sum over all $i$, yeah?

For a convolutional layer we have the following:

\begin{equation*}
\mathbf{\nabla_{x_i}} l = \underset{j}{\sum} \mathbf{\nabla}_{f_j} l * W_{ij}, \quad \text{where} \quad f_j = x_i * k_{ij}
\end{equation*}
15

describes the change in the loss wrt. the input channel, and

\begin{equation*}
\mathbf{\nabla_{W_{ij}}} l = \mathbf{\nabla}_{f_j} l * \overset{\sim}{x}
\end{equation*}
16

Clearer deduction

Instead one might consider the explicit sums instead of looking at the convolution operation.

Consider the equations for forward propagation:

\begin{equation*}
x_{ij}^\ell = \sum_{a = 0}^{m - 1}\overset{m - 1}{\underset{a=0}{\sum}} y_{(i+a)(j+b)}^{\ell-1} w_{ab}
\end{equation*}
17

where:

  • $x_{ij}^\ell$ is the pre-activations or pre-non-linearities used by the $\ell^{\text{th}}$ layer, which is a convolutional layer.
  • $w_{ab}$ is an entry in the weight-matrix for the correspondig feature-map or filter
  • $y_{(i+a), (j+b)}^{\ell-1}$ is the activation or non-linearity from the previous layer, which can be any type of layer (pooling, convolutional, etc.)

Then the activation of the $j^{\text{th}}$ feature-map / filter in the $\ell^{\text{th}}$ layer (a convolutional layer) is:

\begin{equation*}
y_{j}^{\ell} = a \Big(\sum_i x_{ij}^\ell \Big)
\end{equation*}
18

Could also have some learning rate $g_j$ multiplied by $a$.

Now, for backward propagation we have:

\begin{equation*}
\frac{\partial \mathcal{L}}{\partial w_{ab}} = \overset{N-m}{\underset{i=0}{\sum}} \overset{N-m}{\underset{j=0}{\sum}} \frac{\partial{\mathcal{L}}}{\partial x_{ij}^\ell}} \frac{\partial x_{ij}^{l}}{\partial w_{ab}}
= \overset{N-m}{\underset{i=0}{\sum}} \overset{N-m}{\underset{j=0}{\sum}} \frac{\partial{\mathcal{L}}}{\partial x_{ij}^\ell}} y_{(i+a), (j+b)}^{l-1}
\end{equation*}
19

for each entry in the weight matrix for each feature-map.

Note the following:

  • This double sum corresponds to accumulating the loss for the weight
  • Sum over all $x_{ij}^\ell$ expressions in which $w_{ab}$ occurs (corresponds to weight-sharing)

And since the above expression depends on $x_{ij}^\ell$ we need to compute that!

\begin{equation*}
\frac{\partial E}{\partial x_{ij}^\ell} = \frac{\partial E}{\partial y_{ij}^\ell} \frac{\partial y_{ij}^\ell}{\partial x_{ij}^\ell} = 
\frac{\partial E}{\partial y_{ij}^\ell} \frac{\partial }{\partial x_{ ij}^\ell}\left(a(x_{ij}^\ell)\right) = 
\frac{\partial E}{\partial y_{ij}^\ell} a'(x_{ij}^\ell)
\end{equation*}
20

There you go! And we already know the error on the $l^\text{th}$ layer, so we're good!

And when doing back-propagation we need to describe the loss for some layer $l$ wrt. to the next layer, $l + 1$:

\begin{equation*}
\p{E}{y_{ij}^{\ell -1}} = \sum_{a=0}^{m-1} \sum_{b=0}^{m-1} \p{E}{x_{(i-a)(j-b)}^\ell}\p{x_{(i-a)(j-b)}^\ell}{y_{ij}^{\ell-1}}
= \sum_{a=0}^{m-1} \sum_{b=0}^{m-1} \p{E}{x_{(i-a)(j-b)}^\ell} w_{ab}
\end{equation*}
21

where we note that:

  • $w_{ab}$ came from the definition for the forward-propagation
  • expression looks slightly like it could be expressed using convolution, but instead of having $x_{(i+a)(j+b)}^\ell$ we have $x_{(i-a)(j-b)}^\ell$.
  • expression only makes sense for points that are at least $m$ away from the top and left edges (because $i-a$ and $j-b$ mate)

We solve these problems by:

  • pad the top and left edges with zeros
  • then flip axes of $w$

and then we can express this using the convolution operation! (which I'm not showing, because I couldn't figure out how to it. I was tired, mkay?!)

Q & A

DONE Why is the factor of $W_{ij}$ in the derivative of the loss-function wrt. $x_i$ for a convolutional layer not with axes swapped?

Have a look at the derivation here. (based on this blog post) Basically, it's easier to see what's going on if you consider the actual sums, instead of looking at the kernel operation, in my opinion.

DONE View on filters / feature-maps and weight- or parameter-sharing

First we ignore the entire concept of feature-maps / filters.

You can view the weight- or parameter-sharing in two ways:

  1. We have one neuron / hidden unit for each window, i.e. everytime you move the window you are using a new neuron / hidden unit to view pixels / data-points inside the window. Then you think about all of this aaand:
    • There is in fact nothing special with a neuron / hidden unit, but rather the weights it uses for it's computation (assuming these neurons have the same activation function).
    • If we then are to make all these different neurons use the same weights, voilá! We have our weight-sharing!
  2. We have one neuron / hidden unit with its weight-matrix for its receptive field / window. As we slide over, we simply move the neuron and it's connections with us.

In the 1st "view", the feature-map / filter corresponds to all these separate neurons / hidden-units which use the same weight-matrix, and having multiple feature-maps / filters corresponds to having multiple such sets of neurons / hidden units with their corresponding weight-matrix.

In the 2nd "view", the feature-map / filter is just a specific weight-matrix, and having multiple independent weight-matrices corresponds to having multiple feature-maps / filters.

Generative Adversarial Networks (GAN)

Notation

  • $G$ - generative model that captures the data distribution, a mapping to the input space
  • $D$ - discriminative model that estimates the probability that a sample came from the data rather than $G$
  • $D(\mathbf{x}; \theta_d)$ - single value representing the probability that $\mathbf{x}$ came from the data (i.e. is "real") rather than generated by $G$
  • $\theta_g$ - parameter for $G$
  • $\theta_d$ - parameter for $D$
  • $p_g$ - distribution estimated by the generator $G$
  • $p_{data}$ or $p_r$ - distribution over the real data
  • $p_z(\mathbf{z})$ - distribution from where we sample inputs to the generative model $G$, i.e. the output-sample from $G$ is $G(\mathbf{z} ; \theta_g)$, i.e. the distribution over the noise used by the generator

Overview

  • Goal is to train $G$ to be so good at generating samples that $D$ really can't tell whether or not the input $\mathbf{x}$ came from $G$ or is "real"
    • Example: inputs are pictures of dogs → $G$ learns to generate pictures of dogs so well that $D$ can't tell if it's actually a "real" picture of a dog or one generated by $G$
  • In the space of arbitrary functions $G$ and $D$, a unique solution exists, with $G$ recovering the training data distribution ($p_g = p_{data}$)

Cost

Kullback-Leibner Divergence

In other words, $D$ and $G$ play the following two-player minimax game with value function $V(G, D)$:

\begin{equation*}
  \min_G \max_D V(D, G) = 
    \mathbb{E}_{\mathbf{x} \sim p_{\text{data}} (\mathbf{x})} [\log D(\mathbf{x})] +
    \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}(\mathbf{z})} [\log (1 - D(G(\mathbf{z})))]
\end{equation*}
22

Remember that this actually is optimizing over $\theta_g$ and $\theta_d$, the parameters of the models.

Jensen-Shannon Divergence

I wasn't aware of this when I first wrote the notes on GANs, hence there might be some changes which needs to be made in the rest of the document to accomodate (especially to the algorithm section, as this uses the derivative of KL-divergence).

There is a "problem" with the KL-divergence; it's asymmetric. That is, if $p(x)$ is close to zero, but $q(x)$ is signficantly non-zero, the effect of $q$ is disregarded.

Jensen-Shannon divergence is another measure of similiarity between two distributions, which have the following properties:

  • bounded by $[0, 1]
  • symmetric
  • smooth(er than KL-divergence)
\begin{equation*}
D_{JS} \big( p \ || \ q \big) = \frac{1}{2} D_{KL} \bigg( p \ || \ \frac{p + q}{2} \bigg) + \frac{1}{2} D_{KL} \bigg( q \ || \ \frac{p + q}{2} \bigg)
\end{equation*}

Training

  • $D$ and $G$ are playing a minimax game
  • Optimizing $D$ in completion in the inner loop of training is computationally prohibitive and on finite datasets would lead to overfitting
  • Solution: alternate between $k$ steps of optimizing $D$ and one step optimizing $G$
    • $D$ is maintained near its optimal solution, as long as $G$ converges slowly enough

Optimal value for $D$, the discriminator

The loss function is given by

\begin{equation*}
L(G, D) = \int_{x} \bigg( p_r(x) \log \big( D(x) \big) + p_g(x) \log \big( 1 - D(x) \big) \bigg) \ dx
\end{equation*}

We're currently interested in maximizing $L(G, D)$ wrt. $D(x)$, thus

\begin{equation*}
\frac{\partial L}{\partial D} = \int_{x} p_r(x) \bigg( \frac{\partial}{\partial D} \log D \bigg) + p_g(x) \bigg( \frac{\partial}{\partial D} \log (1 - D) \bigg) \ dx
\end{equation*}

where we've assumed it's alright to interchange the integration and derivative. This gives us

\begin{equation*}
\frac{\partial L}{\partial D} = \int_{x} \frac{p_r(x)}{D} -  \frac{p_g(x)}{1 - D} \ dx
\end{equation*}

Setting equal to zero, we get

\begin{equation*}
\frac{\partial L}{\partial D} = 0 \implies D = \frac{p_r(x)}{p_r(x) + p_g(x)}
\end{equation*}

If we then assume that the generator is trained to optimality, then $p_g \approx p_r$, thus

\begin{equation*}
D^* \approx \frac{1}{2}
\end{equation*}

is the optimal value wrt. $D$ alone.

In this case, we the loss is given by

\begin{equation*}
L(G, D^*) = - 2 \log 2
\end{equation*}

Algorithm

Minibatch SGD training of GANs. The number of steps to apply to the discriminator in the inner loop, $k$, is a hyperparameter. Least expensive option is $k = 1$.

for number of training iterations do

  • for $k$ steps do
    • Sample minibatch of $m$ noise samples $\{ \mathbf{z}^{(1)}, ..., \mathbf{z}^{(m)} \}$ from noise prior $p_g(\mathbf{z})$
    • Sample minibatch of $m$ examples $\{ \mathbf{x}^{(1)}, ..., \mathbf{x}^{(m)} \}$ from data distribution $p_\text{data}(\mathbf{x})$
    • Update $D$ by ascending its stochastic gradient:

      \begin{equation*}
  \theta_d \leftarrow \theta_d + \nabla_{\theta_d} \frac{1}{m} \sum_{i=1}^m \Big[ \log D(\mathbf{x}^{(i)}) + \log \Big( 1 - D \big( G(\mathbf{z}^{(i)}) \big) \Big) \Big]
\end{equation*}
23
  • end for
  • Sample minibatch of $m$ noise samples $\{ \mathbf{z}^{(1)}, ..., \mathbf{z}^{(m)} \}$ from noise prior $p_g(\mathbf{z})$
  • Update the generator by descending its stochastic gradient:

    \begin{equation*}
    \theta_g \leftarrow \theta_g - \nabla_{\theta_g} \frac{1}{m} \sum_{i=1}^m \log \Big( 1 - D \big( G(\mathbf{z}^{(i)}) \big) \Big)
\end{equation*}
24

end for

This demonstrates a standard SGD, but we can replace the update steps with any stochastic gradient-based optimization methods (SGD with momentum, etc.).

Implementation

Issues

Nash equilibrium: hard to achieve

  • Updating procedure is normally executed by updating both models using their respective gradients jointly
  • If $D$ and $G$ are updated in completely opposite directions, there's a real danger of getting oscillating behavior, and if you're unlucky, this might even diverge

Low dimensional supports

  • Dimensions of many real-world datasets, as represented by $p_r$, only appear to be artificially high
  • Most datasets concentrate in a lower-dimensional manifold
  • $p_g$ also lies in low dimensional manifold due to (usually) the random nuber used by $G$ is low-dimensional
  • both $p_g$ and $p_r$ are low-dimensional manifolds => almost certainly disjoint
    • In fact, when they have disjoint support (set not mapped to zero), we're always capable of finding perfect discriminator that separates real and fake samples 100% correctly

Vanishing gradient

  • If discriminator $D$ is perfect => $L$ goes to zero => no gradient loss

Thus, we have dilemma:

  • If $D$ behaves badly, generator doesn not have accurate feedback and the loss function cannot represent reality
  • If $D$ does a great job, gradient of the loss function drops too rapidly, slowing down training

Mode collapse

  • $G$ may collapse to a setting where it always produces same outputs

Attention

Activation functions

Notation

  • $\mathbf{z}$ is the pre-activation, with $z_j$ being the jth component

Sigmoid

Definition

\begin{equation*}
\text{sigmoid}(z_j) = \frac{1}{1 + e^{-z_j}}
\end{equation*}
25

Rectified Linear Unit (ReLu)

Definition

\begin{equation*}
\text{ReLU}(z_j) = \max(0, z_j)
\end{equation*}
26

Pros & cons

  • No vanishing gradient
  • But we might have exploding gradients
  • Sparsity

Notes

  • Exploding gradients can be mitigated by "clipping" the gradients, i.e. setting an upper- and lower-limit for the value of the gradient
  • There are multiple variants of ReLU, where most of them include a small non-zero gradient when the unit is not active

Softmax

Definition

\begin{equation*}
\text{softmax} (z_j) = \frac{e^{z_j}}{\sum_k e^{z_k}}
\end{equation*}
27

Pros & cons

  • Provides a normalized probability-distribution over the activations
  • When viewed in a cross-entropy cost-model, the gradients for the loss-function is cheap computationally and numerical stable

Notes

  • Usually only used in the top (output) layer

Exponential Linear Unit (ELU)

Definition

\begin{equation*}
\text{elu}(z_j) = 
  \begin{cases}
  z_j, & \text{if}\ z_j \ge 0 \\
  a(e^{z_j} - 1), & \text{otherwise}
  \end{cases}
\end{equation*}
28

where $a \ge 0$ and is a hyper-parameter.

Pros & cons

  • Attempts to make the mean activations closer to zero which speeds up learning
  • All the pros of ReLU

Notes

  • Shown to have improved performance compared to standard ReLU

Scaled Exponential Linear Unit (SELU) [NEW]

Defintion

\begin{equation*}
\text{selu}(z_j) = \lambda
  \begin{cases}
    z_j, & \text{if}\ x > 0 \\
    \alpha e^{z_j} - \alpha, & \text{otherwise}
  \end{cases}
\end{equation*}
29

Pros & cons

  • Allows us to construct a Self-normalizing Neural Network (SNN), which attempts to make the mean activations closer to zero and the variance of the activiations close to 1. This is supposed to (and experiments show) greatly increase the stability and efficiency of training.

Notes

Loss functions

Softmax

Notation

  • $i$ - ith training sample
  • $L_i$ -

Definition

  • Corresponds to cross-entropy loss
\begin{equation*}
p_k = \frac{e^{f_k}}{\sum_j e^{f_j}}
\end{equation*}
30

Derivtive wrt. cross-entropy loss

Cross-entropy loss function:

\begin{equation*}
	L_i = - \log ( p_{y_i} ) 
\end{equation*}
31

With $p_{y_i}$ being a softmax: $L_i = - \log (p_{y_i}) = \log \Big( \sum_j e^{f_j} \Big) - f_{y_i}$

And thus,

\begin{equation*}
\begin{split}
\frac{\partial L_i}{\partial f_{k}} &= \frac{\partial}{\partial f_k} \log \Big( \sum_j e^{f_j} \Big) - \frac{\partial f_{y_i}}{\partial f_{k}} \\
&= \frac{e^{f_k}}{\sum_j e^{f_j}} - \frac{\partial f_{y_i}}{\partial f_k} \\
&= p_k - \frac{\partial f_{y_i}}{\partial f_k} \\
\end{split}
\end{equation*}
32

Thus,

\begin{equation*}
\frac{\partial L_i}{\partial f_{k}} = 
\begin{cases}
&p_k - 1, \quad \text{if } k = y_i \\
&p_k , \quad \text{otherwise}
\end{cases}
\end{equation*}
33

And a bit more compactly,

\begin{equation*}
\frac{\partial L_i}{\partial f_k} = p_k - 1_{k = y_i}
\end{equation*}
34

Resources