# Neural Networks

## Table of Contents

## Neural Networks

If you ever want to deduce the backpropagation-algorithm yourself, DO NOT ATTEMPT TO DO IT USING MATRIX- AND VECTOR-NOTATION!!!!

It makes it sooo much harder. If you check out the Summary you can
see the equations in matrix- and vector-form, but these were
deduced *elementwise* and *then* transformed into this notation
due to the efficient nature of these operations rather than
elementwise operations.

### Backpropagation

#### Notation

- is our prediction / final output
- is what it's
*supposed*to be, i.e. the*real*output - the output of the layer
- ,
*in this case*, not necessary in general - weight matrix between the and layer
- is the row of s.t.

#### Method

This my friends, is *backpropagation*.

We have some loss-function , which we set to be the
*least-squares* loss just to be concrete:

We're interested in how our loss-function for some prediction changes wrt. to the weights in all the different layers, right?

We start at the back, with the error of the last layer :

We then let (dropping the subscript, as we now consider a vector of outputs), and write

Where the Hadamar-product is simply a element-wise product.

Next, we need to obtain an expression for the error in the
layer as a function of the *next* layer,
i.e. the layer.

Consider the error for the activation / neuron in the layer.

With this recursive relationship we can start from the back, since we already know , and work our way to the actual input to the entire network.

Still got to obtain an expression for ! is simply given by all the

Taking the derivative of this wrt.

due to .

Substituting back into the expression for

And finally rewriting in matrix-form:

So, we now have the following expressions:

We have our recursive relationship between the errors in
the layers, *and* the error in the final layer, allowing us
to compute the errors in the preceding layers using the
recursion.

But the entire reason for why all this is interesting is
that we want to obtain an expression for how to *update*
the weights and biases in each layer to improve
(i.e. reduce) these errors!

That is; we want some expressions for and .

Let's turn this into vector-notation for each row in .

And finally a full-blown matrix-notation:

And from the second-to-last line in the previous equation we see that if we instead take the partial-derivative wrt. we obtain

#### Summary

And we end up with the following equations, using matrix notation:

## Convolutional Neural Networks (CNN)

Have the following properties:

### Local connectivity

Each hidden unit look at small part of image, i.e. we have small window / "receptive field" which each hidden unit ("neuron") looks at. In image-recognition, each neuron looks at a different "square" of the image.

This means that each hidden unit will have one weight / connection for each pixel or data-point in the receptive field.

In the case where each pixel or data-point is of multiple dimensions,
we will have a connection from each of these dimension, i.e. for `N`

dimensional data-points we have `N * (area of receptive field)`

weights / connections.

#### Why

- Fully connected hidden layer would have an unmanageable number of parameters
- Computing the linear activations of the hidden units would be computational very expensive

### Parameter sharing

Share matrix of parameters / weights across *certain* hidden units.
That is; we define a *feature map* which is a set of hidden units
that share parameters. All hidden units in the same *feature map*
are looking at separate parts of the image. Typically the hidden units
in a feature map together covers the entire image.
This is also referred to as a *filter* I believe.

#### Notation

*channel*is the "data-point" which can be of multiple dimensions. Used because of the typical input for an image being RGB-channels.- input channel, specifies which dimension of the input / "data-point" we're considering
- feature map
- is the input channel
- is the
**matrix**connecting the input channel to the feature map, i.e. with RGB-channels corresponds to the red-channel () and 2nd feature map () - is the convolution kernel (
**matrix**) - means with rows and columns
*flipped* - is the learning factor
- is the hidden layer
- convolution operation
- convolution operation with zero-padding
- is a (usually non-linear)
*activation function*, e.g.`sigmoid`

,`ReLU`

and`tanh`

. - , is not always used

#### Why

- Reduces the number of parameters further
- Each feature map or filter will extract the same feature at each
different position in the image. Feature are
*equivariant*.

#### Discrete convolution

##### Why do we use it in a Convolution Network?

We have a connection between each input (channel)
and hidden unit in a feature map. We want to
compute the element-wise multiplication between the
input matrix and the weight-matrix, then *sum* all
the entries in the resulting matrix.

If we flip the rows and columns of the weight-matrix,
this operation corresponds to taking the *convolution*
operation between this flipped matrix and the inputs.

Why do we want to do that? Efficiency. The convolution operation is something which is heavily used in signal processing, and so we can easily take advantage of previous techniques for computing this product.

This is really why we use the convolution operation,
and why it's called, well, a *convolutional* network.

### Pooling / subsampling hidden units

- Performed in non-overlapping neighborhoods (subsampling)
- Aggregate results from this neighborhood

#### Maximum pooling

Take the maximum value found in this neighborhood

#### Average pooling

Compute the average of the neighborhood.

#### Subsampling

- Generalization of average pooling
- Average pooling with learnable weights for each filter map

### Back-propagation

For some loss-function we can use back-propogation to compute the total loss for a prediction.

Here we are only working on a single input at the time. Generalizing to multiple inputs would simply be to also sum over all , yeah?

For a convolutional layer we have the following:

describes the change in the loss wrt. the input channel, and

#### Clearer deduction

Instead one might consider the explicit sums instead of looking at the convolution operation.

Consider the equations for **forward** propagation:

where:

- is the
*pre-activations*or*pre-non-linearities**used*by the layer, which is a*convolutional layer*. - is an entry in the weight-matrix for the correspondig
*feature-map*or*filter* - is the
*activation*or*non-linearity*from the previous layer, which can be any type of layer (pooling, convolutional, etc.)

Then the activation of the feature-map / filter in the layer (a convolutional layer) is:

Could also have some learning rate multiplied by .

Now, for **backward** propagation we have:

for each entry in the weight matrix for each feature-map.

Note the following:

- This double sum corresponds to accumulating the loss for the weight
- Sum over all expressions in which occurs (corresponds to weight-sharing)

And since the above expression depends on we need to compute that!

There you go! And we already know the error on the layer, so we're good!

And when doing back-propagation we need to describe the loss for some layer
wrt. to the *next* layer, :

where we note that:

- came from the definition for the forward-propagation
- expression looks
*slightly*like it could be expressed using convolution, but instead of having we have . - expression only makes sense for points that are
*at least*away from the top and left edges (because and mate)

We solve these problems by:

- pad the top and left edges with zeros
- then flip axes of

and then we can express this using the convolution operation! (which I'm not showing, because I couldn't figure out how to it. I was tired, mkay?!)

### Q & A

#### DONE Why is the factor of in the derivative of the loss-function wrt. for a convolutional layer *not* with axes swapped?

Have a look at the derivation here. (based on this blog post) Basically, it's easier to see what's going on if you consider the actual sums, instead of looking at the kernel operation, in my opinion.

#### DONE View on filters / feature-maps and weight- or parameter-sharing

First we ignore the entire concept of feature-maps / filters.

You can view the weight- or parameter-sharing in two ways:

- We have one neuron / hidden unit for
*each window*, i.e. everytime you move the window you are using a new neuron / hidden unit to view pixels / data-points inside the window. Then you think about all of this aaand:- There is in fact nothing special with a neuron / hidden unit, but rather the
*weights*it uses for it's computation (assuming these neurons have the same activation function). - If we then are to make all these different neurons use the same weights, voilá! We have our weight-sharing!

- There is in fact nothing special with a neuron / hidden unit, but rather the
- We have one neuron / hidden unit with its weight-matrix for its receptive field / window. As we slide over, we simply move the neuron and it's connections with us.

In the 1st "view", the feature-map / filter corresponds to all these separate neurons / hidden-units which use the same weight-matrix, and having multiple feature-maps / filters corresponds to having multiple such sets of neurons / hidden units with their corresponding weight-matrix.

In the 2nd "view", the feature-map / filter is just a specific weight-matrix, and having multiple independent weight-matrices corresponds to having multiple feature-maps / filters.

## Generative Adversarial Networks (GAN)

### Notation

- - generative model that captures the data distribution, a mapping to the input space
- - discriminative model that estimates the probability that a sample
came from the
*data*rather than - - single value representing the probability that came from the data (i.e. is "real") rather than generated by
- - parameter for
- - parameter for
- - distribution estimated by the
*generator* - or - distribution over the
*real*data - - distribution from where we sample inputs to the generative model , i.e. the output-sample from is , i.e. the distribution over the
*noise*used by the generator

### Overview

**Goal**is to train to be so good at generating samples that really can't tell whether or not the input came from or is "real"*Example:*inputs are pictures of dogs → learns to generate pictures of dogs so well that can't tell if it's actually a "real" picture of a dog or one generated by

- In the space of arbitrary functions and , a unique solution exists, with recovering the training data distribution ()

### Cost

#### Kullback-Leibner Divergence

In other words, and play the following two-player minimax game with value function :

Remember that this *actually* is optimizing over and ,
the parameters of the models.

#### Jensen-Shannon Divergence

I wasn't aware of this when I first wrote the notes on GANs, hence there might be some changes which needs to be made in the rest of the document to accomodate (especially to the algorithm section, as this uses the derivative of KL-divergence).

There is a "problem" with the KL-divergence; it's **asymmetric**. That is, if is close to zero, but is signficantly non-zero, the effect of is *disregarded*.

**Jensen-Shannon divergence** is another measure of similiarity between two distributions, which have the following properties:

*bounded*by $[0, 1]*symmetric*- smooth(er than KL-divergence)

### Training

- and are playing a
**minimax game** - Optimizing in completion in the inner loop of training is computationally prohibitive and on finite datasets would lead to overfitting
**Solution:**alternate between steps of optimizing and one step optimizing- is maintained near its optimal solution, as long as converges slowly enough

#### Optimal value for , the discriminator

The loss function is given by

We're currently interested in maximizing wrt. , thus

where we've assumed it's alright to interchange the integration and derivative. This gives us

Setting equal to zero, we get

If we then assume that the generator is trained to optimality, then , thus

is the optimal value wrt. alone.

In this case, we the loss is given by

### Algorithm

Minibatch SGD training of GANs. The number of steps to apply to the discriminator in the inner loop, , is a hyperparameter. Least expensive option is .

**for** number of training iterations **do**

**for**steps**do**- Sample minibatch of noise samples from noise prior
- Sample minibatch of examples from data distribution
Update by

*ascending*its stochastic gradient:23

**end for**- Sample minibatch of noise samples from noise prior
Update the generator by

*descending*its stochastic gradient:24

**end for**

This demonstrates a standard SGD, but we can replace the update steps with any stochastic gradient-based optimization methods (SGD with momentum, etc.).

### Implementation

### Issues

#### Nash equilibrium: hard to achieve

- Updating procedure is normally executed by updating both models using their respective gradients jointly
- If and are updated in completely opposite directions, there's a real danger of getting oscillating behavior, and if you're unlucky, this might even diverge

#### Low dimensional supports

- Dimensions of many real-world datasets, as represented by , only
*appear*to be artificially high - Most datasets concentrate in a lower-dimensional manifold
- also lies in low dimensional manifold due to (usually) the random nuber used by is low-dimensional
*both*and are low-dimensional manifolds => almost certainly disjoint- In fact, when they have disjoint support (set not mapped to zero), we're always capable of finding perfect discriminator that separates real and fake samples 100% correctly

#### Vanishing gradient

- If discriminator is perfect => goes to zero => no gradient loss

Thus, we have **dilemma**:

- If behaves badly, generator doesn not have accurate feedback and the loss function cannot represent reality
- If does a great job, gradient of the loss function drops
*too*rapidly, slowing down training

#### Mode collapse

- may collapse to a setting where it always produces same outputs

### Improving GAN training

### Wasserstein GAN (WGAN)

## Attention

## Activation functions

### Notation

- is the pre-activation, with being the
*jth*component

### Sigmoid

#### Definition

### Rectified Linear Unit (ReLu)

#### Definition

#### Pros & cons

- No vanishing gradient
- But we might have exploding gradients
- Sparsity

#### Notes

- Exploding gradients can be mitigated by "clipping" the gradients, i.e. setting an upper- and lower-limit for the value of the gradient
- There are multiple variants of ReLU, where most of them include
a small non-zero gradient when the unit is
*not*active

### Softmax

#### Definition

#### Pros & cons

- Provides a normalized probability-distribution over the activations
- When viewed in a cross-entropy cost-model, the gradients for the loss-function is cheap computationally and numerical stable

#### Notes

- Usually only used in the top (output) layer

### Exponential Linear Unit (ELU)

#### Definition

where and is a hyper-parameter.

#### Pros & cons

- Attempts to make the mean activations closer to zero which speeds up learning
- All the pros of ReLU

#### Notes

- Shown to have improved performance compared to standard ReLU

### Scaled Exponential Linear Unit (SELU) [NEW]

#### Defintion

#### Pros & cons

- Allows us to construct a
*Self-normalizing Neural Network*(SNN), which attempts to make the mean activations closer to zero*and*the variance of the activiations close to 1. This is supposed to (and experiments show) greatly increase the stability and efficiency of training.

#### Notes

- Extension of ELU
- Recent discovery, see Self-Normalizing Neural Networks paper

## Loss functions

### Softmax

#### Notation

- -
*ith*training sample - -

#### Definition

- Corresponds to cross-entropy loss

#### Derivtive wrt. cross-entropy loss

Cross-entropy loss function:

With being a softmax:

And thus,

Thus,

And a bit more compactly,

## Resources

- Amazing overview of everything regarding Feed-forward and convolutional networks. Also includes implementation tips, a range of regularization methods and optimization methods.
- CS231 course-notes from Stanford Provides a fantastic overview of everything from activiaton-, loss-functions and different neural network architectures.