Neural Networks

Table of Contents

Neural Networks

If you ever want to deduce the backpropagation-algorithm yourself, DO NOT ATTEMPT TO DO IT USING MATRIX- AND VECTOR-NOTATION!!!!

It makes it sooo much harder. If you check out the Summary you can see the equations in matrix- and vector-form, but these were deduced elementwise and then transformed into this notation due to the efficient nature of these operations rather than elementwise operations.



  • neural_networks_ba895132c86b7db0f2feefbfcfda9a856823dfb9.png is our prediction / final output
  • neural_networks_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png is what it's supposed to be, i.e. the real output
  • neural_networks_d51fb53733ad2d22c670e58a9a810fccdd781141.png the output of the neural_networks_552d8b44ac2608742a5bf60aabbe6f862b0acddc.png layer
  • neural_networks_4b660c5ae42acd9f5902c0959eb3a0cc1bb76ea0.png, in this case, not necessary in general
  • neural_networks_95812024acc36276c1e3e1aa026833338166ebc0.png
  • neural_networks_8c3efe310880970ab8b05354335090c6083fd7d8.png
  • neural_networks_8ca1e4b9e5ad9a44f682e29df3c6c1d587af853a.png weight matrix between the neural_networks_399fa921acdb69c40e8d812d005b0edbb662cc18.png and neural_networks_552d8b44ac2608742a5bf60aabbe6f862b0acddc.png layer
  • neural_networks_c321fcc6ece4c32034dca4d982f6625e0a5dc30e.png is the neural_networks_da35fe4f391f750d7b1845ec55f7d46d813a5ef1.png row of neural_networks_8ca1e4b9e5ad9a44f682e29df3c6c1d587af853a.png s.t. neural_networks_4b5452797da8af472bb2907954f37f9a7e8c647b.png


This my friends, is backpropagation.

We have some loss-function neural_networks_4c8645cdaa1e14abf0ddb246a95777de85158726.png, which we set to be the least-squares loss just to be concrete:


We're interested in how our loss-function for some prediction changes wrt. to the weights in all the different layers, right?

We start at the back, with the error of the last layer neural_networks_3835b42f20d6a653807effb5194cc2d560256746.png :


We then let neural_networks_993241d79feceeaf013e9e93404c4b2a59869181.png (dropping the subscript, as we now consider a vector of outputs), and write


Where the Hadamar-product is simply a element-wise product.

Next, we need to obtain an expression for the error in the neural_networks_1fc002a89d7e1a069ff2abf955822e38b685bb85.png layer as a function of the next layer, i.e. the neural_networks_9f8109db533b0fb215d9d0cdcde2baf4e3ac68f8.png layer.

Consider the error neural_networks_738f3c388a3e8b31c5b44cd12acbf1067fc7132d.png for the neural_networks_494f69b962b78c1246de9860610fa54782c36233.png activation / neuron in the neural_networks_1fc002a89d7e1a069ff2abf955822e38b685bb85.png layer.


With this recursive relationship we can start from the back, since we already know neural_networks_35ccf1c403eba37307c812ecb01eb8218b6c4a52.png, and work our way to the actual input to the entire network.

Still got to obtain an expression for neural_networks_7ce8c85dad3eed5e8346d9c803fb65e56fc9b4d0.png ! neural_networks_8bf28ac6544cea582d9992130198223518539252.png is simply given by all the


Taking the derivative of this wrt. neural_networks_1e6351b32be2cbb743c335ca6322d555383ed8f0.png


due to neural_networks_d353529c149c9cff0311d1451743898bc98a1a0c.png.

Substituting back into the expression for neural_networks_738f3c388a3e8b31c5b44cd12acbf1067fc7132d.png


And finally rewriting in matrix-form:


So, we now have the following expressions:


We have our recursive relationship between the errors in the layers, and the error in the final layer, allowing us to compute the errors in the preceding layers using the recursion.

But the entire reason for why all this is interesting is that we want to obtain an expression for how to update the weights neural_networks_8ca1e4b9e5ad9a44f682e29df3c6c1d587af853a.png and biases neural_networks_69fce79e2c057dbd258ea2b1d3da1600cdd0477b.png in each layer to improve (i.e. reduce) these errors!

That is; we want some expressions for neural_networks_2222f6f2490f22a115f13971d9d3c97524434931.png and neural_networks_3800ca1882bf0f52c8d6098300e7fac567525018.png.


Let's turn this into vector-notation for each row in neural_networks_8ca1e4b9e5ad9a44f682e29df3c6c1d587af853a.png.


And finally a full-blown matrix-notation:


And from the second-to-last line in the previous equation we see that if we instead take the partial-derivative wrt. neural_networks_83f26075efd4c456d789e0dac84583700f843fe6.png we obtain



And we end up with the following equations, using matrix notation:


Convolutional Neural Networks (CNN)

Local connectivity

Each hidden unit look at small part of image, i.e. we have small window / "receptive field" which each hidden unit ("neuron") looks at. In image-recognition, each neuron looks at a different "square" of the image.


This means that each hidden unit will have one weight / connection for each pixel or data-point in the receptive field.

In the case where each pixel or data-point is of multiple dimensions, we will have a connection from each of these dimension, i.e. for N dimensional data-points we have N * (area of receptive field) weights / connections.


  • Fully connected hidden layer would have an unmanageable number of parameters
  • Computing the linear activations of the hidden units would be computational very expensive

Parameter sharing

Share matrix of parameters / weights across certain hidden units. That is; we define a feature map which is a set of hidden units that share parameters. All hidden units in the same feature map are looking at separate parts of the image. Typically the hidden units in a feature map together covers the entire image. This is also referred to as a filter I believe.


  • channel is the "data-point" which can be of multiple dimensions. Used because of the typical input for an image being RGB-channels.
  • neural_networks_debedb7f7748a9279d98de01a251be322c096645.png input channel, specifies which dimension of the input / "data-point" we're considering
  • neural_networks_494f69b962b78c1246de9860610fa54782c36233.png feature map
  • neural_networks_e9f4d218474bc10aa94958ec30139aee865c0173.png is the neural_networks_debedb7f7748a9279d98de01a251be322c096645.png input channel
  • neural_networks_485943dd128848be56b354aa3e60a8203e5c4d60.png is the matrix connecting the neural_networks_debedb7f7748a9279d98de01a251be322c096645.png input channel to the neural_networks_494f69b962b78c1246de9860610fa54782c36233.png feature map, i.e. with RGB-channels neural_networks_b78256fb239f2c0bf5c2ad3b391c38bda59ecbc9.png corresponds to the red-channel (neural_networks_969aee66e9997ecfc4e1fd03823fca7fbe3f5532.png) and 2nd feature map (neural_networks_6eb2fb2d0384008dd6e19b219c776533ec9202f8.png)
  • neural_networks_54cac9faf0168f174427686f81150cf1619ed42e.png is the convolution kernel (matrix)
  • neural_networks_7d87f882b572f185cf8a13e5c0ce346d458e14b7.png means neural_networks_0207be880056b9a69e22e729dd37bced29cd174a.png with rows and columns flipped
  • neural_networks_67b47ea58fc87150c1049a69fc2b0c30a027e961.png
  • neural_networks_581f6d00fec281764db7cd99508f435d86844f6d.png is the learning factor
  • neural_networks_917470114d02b6d88d59e1010f45bcbe2403ebad.png is the hidden layer
  • neural_networks_7083259f29256d7f0845c7bc3801223eabf76ba4.png convolution operation
  • neural_networks_cd432cf2277b055fb63cd74cdd1f88cc245da804.png convolution operation with zero-padding
  • neural_networks_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png is a (usually non-linear) activation function, e.g. sigmoid, ReLU and tanh.
  • neural_networks_38c883b7b054b7100710fa7aa9b06847132c0c3f.png, neural_networks_581f6d00fec281764db7cd99508f435d86844f6d.png is not always used


  • Reduces the number of parameters further
  • Each feature map or filter will extract the same feature at each different position in the image. Feature are equivariant.

Discrete convolution

Why do we use it in a Convolution Network?

We have a connection between each input (channel) and hidden unit in a feature map. We want to compute the element-wise multiplication between the input matrix and the weight-matrix, then sum all the entries in the resulting matrix.

If we flip the rows and columns of the weight-matrix, this operation corresponds to taking the convolution operation between this flipped matrix and the inputs.

Why do we want to do that? Efficiency. The convolution operation is something which is heavily used in signal processing, and so we can easily take advantage of previous techniques for computing this product.

This is really why we use the convolution operation, and why it's called, well, a convolutional network.

Pooling / subsampling hidden units

  • Performed in non-overlapping neighborhoods (subsampling)
  • Aggregate results from this neighborhood

Maximum pooling

Take the maximum value found in this neighborhood

Average pooling

Compute the average of the neighborhood.


  • Generalization of average pooling
  • Average pooling with learnable weights for each filter map


For some loss-function neural_networks_51d63cb7c481df34149becc8c883f9f9a2d2136c.png we can use back-propogation to compute the total loss for a prediction.

Here we are only working on a single input at the time. Generalizing to multiple inputs would simply be to also sum over all neural_networks_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png, yeah?

For a convolutional layer we have the following:


describes the change in the loss wrt. the input channel, and


Clearer deduction

Instead one might consider the explicit sums instead of looking at the convolution operation.

Consider the equations for forward propagation:



  • neural_networks_8e05ceda238d03ad4f30caf2fd2a62d6c948af87.png is the pre-activations or pre-non-linearities used by the neural_networks_387b1cc7b59ca9e041fb256f7bff8dec9b52a466.png layer, which is a convolutional layer.
  • neural_networks_bbbd0508b0d1096c96fe25e00bb5ac1fa455fb08.png is an entry in the weight-matrix for the correspondig feature-map or filter
  • neural_networks_5ff47658a715ebcc3a57bdcaae37bbd0b5d18a12.png is the activation or non-linearity from the previous layer, which can be any type of layer (pooling, convolutional, etc.)

Then the activation of the neural_networks_494f69b962b78c1246de9860610fa54782c36233.png feature-map / filter in the neural_networks_387b1cc7b59ca9e041fb256f7bff8dec9b52a466.png layer (a convolutional layer) is:


Could also have some learning rate neural_networks_581f6d00fec281764db7cd99508f435d86844f6d.png multiplied by neural_networks_1f36e68578b771d42cf836dbdf0a6923a809fd5c.png.

Now, for backward propagation we have:


for each entry in the weight matrix for each feature-map.

Note the following:

  • This double sum corresponds to accumulating the loss for the weight
  • Sum over all neural_networks_8e05ceda238d03ad4f30caf2fd2a62d6c948af87.png expressions in which neural_networks_bbbd0508b0d1096c96fe25e00bb5ac1fa455fb08.png occurs (corresponds to weight-sharing)

And since the above expression depends on neural_networks_8e05ceda238d03ad4f30caf2fd2a62d6c948af87.png we need to compute that!


There you go! And we already know the error on the neural_networks_552d8b44ac2608742a5bf60aabbe6f862b0acddc.png layer, so we're good!

And when doing back-propagation we need to describe the loss for some layer neural_networks_51d63cb7c481df34149becc8c883f9f9a2d2136c.png wrt. to the next layer, neural_networks_0fc728e835b61daa077100f24d2da3e03533b13a.png:


where we note that:

  • neural_networks_bbbd0508b0d1096c96fe25e00bb5ac1fa455fb08.png came from the definition for the forward-propagation
  • expression looks slightly like it could be expressed using convolution, but instead of having neural_networks_f7486784a1c3e6e32c286c394239a43ed3b65f0b.png we have neural_networks_95558ae822d35a92f08f253f4e95c6a58702a043.png.
  • expression only makes sense for points that are at least neural_networks_fe52c58a96549528260fe0d7d04caf390dd47865.png away from the top and left edges (because neural_networks_2abf8bf661f3786b6c9af27b79a224eb1ebde7fd.png and neural_networks_b752c96945468b8e5e9d19b3ec0c0cfca1a419bb.png mate)

We solve these problems by:

  • pad the top and left edges with zeros
  • then flip axes of neural_networks_fe480f547555026d210b55b5d4ef758235f32832.png

and then we can express this using the convolution operation! (which I'm not showing, because I couldn't figure out how to it. I was tired, mkay?!)

Q & A

DONE Why is the factor of neural_networks_485943dd128848be56b354aa3e60a8203e5c4d60.png in the derivative of the loss-function wrt. neural_networks_e9f4d218474bc10aa94958ec30139aee865c0173.png for a convolutional layer not with axes swapped?

Have a look at the derivation here. (based on this blog post) Basically, it's easier to see what's going on if you consider the actual sums, instead of looking at the kernel operation, in my opinion.

DONE View on filters / feature-maps and weight- or parameter-sharing

First we ignore the entire concept of feature-maps / filters.

You can view the weight- or parameter-sharing in two ways:

  1. We have one neuron / hidden unit for each window, i.e. everytime you move the window you are using a new neuron / hidden unit to view pixels / data-points inside the window. Then you think about all of this aaand:
    • There is in fact nothing special with a neuron / hidden unit, but rather the weights it uses for it's computation (assuming these neurons have the same activation function).
    • If we then are to make all these different neurons use the same weights, voilá! We have our weight-sharing!
  2. We have one neuron / hidden unit with its weight-matrix for its receptive field / window. As we slide over, we simply move the neuron and it's connections with us.

In the 1st "view", the feature-map / filter corresponds to all these separate neurons / hidden-units which use the same weight-matrix, and having multiple feature-maps / filters corresponds to having multiple such sets of neurons / hidden units with their corresponding weight-matrix.

In the 2nd "view", the feature-map / filter is just a specific weight-matrix, and having multiple independent weight-matrices corresponds to having multiple feature-maps / filters.

Generative Adversarial Networks (GAN)


  • neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png - generative model that captures the data distribution, a mapping to the input space
  • neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png - discriminative model that estimates the probability that a sample came from the data rather than neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • neural_networks_450fc1fe984ee475607981209d8cf151d5d9e10f.png - single value representing the probability that neural_networks_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png came from the data (i.e. is "real") rather than generated by neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • neural_networks_5f0220656e701a2edb3bacf602133f44cd5d5e6a.png - parameter for neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • neural_networks_6f5be2bb6f78420f7efe2dbf0e2b8efcd1f5c963.png - parameter for neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png
  • neural_networks_50861d9ecb45ec87d68fb67db39f0cc471069974.png - distribution estimated by the generator neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • neural_networks_dca3bf13a6b4356487afb58e4693ebd770bb078d.png or neural_networks_e8b5eece4e04e00714d1e86bce39bfe5a225f3b8.png - distribution over the real data
  • neural_networks_589da73adc6883d5ed2825ba176aae4fb8b9d2a5.png - distribution from where we sample inputs to the generative model neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png, i.e. the output-sample from neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png is neural_networks_2f2d54cb911afc0c838fef8bd75136df3d8705da.png, i.e. the distribution over the noise used by the generator


  • Goal is to train neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png to be so good at generating samples that neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png really can't tell whether or not the input neural_networks_ed39d9a397196f8f0ce6388b0ea4e0c1dd8becee.png came from neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png or is "real"
    • Example: inputs are pictures of dogs → neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png learns to generate pictures of dogs so well that neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png can't tell if it's actually a "real" picture of a dog or one generated by neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
  • In the space of arbitrary functions neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png and neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png, a unique solution exists, with neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png recovering the training data distribution (neural_networks_c06cf2d33c4e84aab81ba00c8af708ba1eb2b0f1.png)


Kullback-Leibner Divergence

In other words, neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png and neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png play the following two-player minimax game with value function neural_networks_9505c69167dc98583bfd62e28008cb50c8676feb.png:


Remember that this actually is optimizing over neural_networks_5f0220656e701a2edb3bacf602133f44cd5d5e6a.png and neural_networks_6f5be2bb6f78420f7efe2dbf0e2b8efcd1f5c963.png, the parameters of the models.

Jensen-Shannon Divergence

I wasn't aware of this when I first wrote the notes on GANs, hence there might be some changes which needs to be made in the rest of the document to accomodate (especially to the algorithm section, as this uses the derivative of KL-divergence).

There is a "problem" with the KL-divergence; it's asymmetric. That is, if neural_networks_e89168996a065100b69f75f3fc121549ab9f209d.png is close to zero, but neural_networks_5bb7b4fad988877f2f7f7683c9c9bcc657905305.png is signficantly non-zero, the effect of neural_networks_ab437e1f9b3376761b155efe111c9860607c4b86.png is disregarded.

Jensen-Shannon divergence is another measure of similiarity between two distributions, which have the following properties:

  • bounded by $[0, 1]
  • symmetric
  • smooth(er than KL-divergence)



  • neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png and neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png are playing a minimax game
  • Optimizing neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png in completion in the inner loop of training is computationally prohibitive and on finite datasets would lead to overfitting
  • Solution: alternate between neural_networks_094b02afce734f4ce51933d0093ef3d2da9f8123.png steps of optimizing neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png and one step optimizing neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png
    • neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png is maintained near its optimal solution, as long as neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png converges slowly enough

Optimal value for neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png, the discriminator

The loss function is given by


We're currently interested in maximizing neural_networks_e05af9258f5653518cd33b8916926770a0d9362d.png wrt. neural_networks_2eecadd1b897c65f752d99a7c25376bf8239441e.png, thus


where we've assumed it's alright to interchange the integration and derivative. This gives us


Setting equal to zero, we get


If we then assume that the generator is trained to optimality, then neural_networks_9b9040409be82dfd83f3cf1aba06fcc949ea1b4c.png, thus


is the optimal value wrt. neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png alone.

In this case, we the loss is given by



Minibatch SGD training of GANs. The number of steps to apply to the discriminator in the inner loop, neural_networks_094b02afce734f4ce51933d0093ef3d2da9f8123.png, is a hyperparameter. Least expensive option is neural_networks_76ad4212542b94774b27bea1e8dceb373f0f448c.png.

for number of training iterations do

  • for neural_networks_094b02afce734f4ce51933d0093ef3d2da9f8123.png steps do
    • Sample minibatch of neural_networks_fe52c58a96549528260fe0d7d04caf390dd47865.png noise samples neural_networks_1dd762b4e90c7a91ed4cf7060568bec2cc9fe790.png from noise prior neural_networks_fba2f45e6ac74c479e094ca1b81bcf8dfee897c7.png
    • Sample minibatch of neural_networks_fe52c58a96549528260fe0d7d04caf390dd47865.png examples neural_networks_f0a186be07803510907cc9fe24fefc07d07d6255.png from data distribution neural_networks_9cdc7b784c2acb8c0ea2f71549d0f4be6868656c.png
    • Update neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png by ascending its stochastic gradient:


  • end for
  • Sample minibatch of neural_networks_fe52c58a96549528260fe0d7d04caf390dd47865.png noise samples neural_networks_1dd762b4e90c7a91ed4cf7060568bec2cc9fe790.png from noise prior neural_networks_fba2f45e6ac74c479e094ca1b81bcf8dfee897c7.png
  • Update the generator by descending its stochastic gradient:


end for

This demonstrates a standard SGD, but we can replace the update steps with any stochastic gradient-based optimization methods (SGD with momentum, etc.).



Nash equilibrium: hard to achieve

  • Updating procedure is normally executed by updating both models using their respective gradients jointly
  • If neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png and neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png are updated in completely opposite directions, there's a real danger of getting oscillating behavior, and if you're unlucky, this might even diverge

Low dimensional supports

  • Dimensions of many real-world datasets, as represented by neural_networks_e8b5eece4e04e00714d1e86bce39bfe5a225f3b8.png, only appear to be artificially high
  • Most datasets concentrate in a lower-dimensional manifold
  • neural_networks_50861d9ecb45ec87d68fb67db39f0cc471069974.png also lies in low dimensional manifold due to (usually) the random nuber used by neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png is low-dimensional
  • both neural_networks_50861d9ecb45ec87d68fb67db39f0cc471069974.png and neural_networks_e8b5eece4e04e00714d1e86bce39bfe5a225f3b8.png are low-dimensional manifolds => almost certainly disjoint
    • In fact, when they have disjoint support (set not mapped to zero), we're always capable of finding perfect discriminator that separates real and fake samples 100% correctly

Vanishing gradient

  • If discriminator neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png is perfect => neural_networks_9dfb630910b0b3d2a61cb5f1673b9adaa08fe150.png goes to zero => no gradient loss

Thus, we have dilemma:

  • If neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png behaves badly, generator doesn not have accurate feedback and the loss function cannot represent reality
  • If neural_networks_b689cba8d7566f6adaf605a844e193a27e155078.png does a great job, gradient of the loss function drops too rapidly, slowing down training

Mode collapse

  • neural_networks_2de98136973021abb46a5a3fc1e4318bafb84264.png may collapse to a setting where it always produces same outputs


Activation functions


  • neural_networks_de12a97aeed74fd9dfc7765b2af33c8777b0bbb1.png is the pre-activation, with neural_networks_98045cefc3acb2408227e174793cfe5dcd0e4306.png being the jth component




Rectified Linear Unit (ReLu)



Pros & cons

  • No vanishing gradient
  • But we might have exploding gradients
  • Sparsity


  • Exploding gradients can be mitigated by "clipping" the gradients, i.e. setting an upper- and lower-limit for the value of the gradient
  • There are multiple variants of ReLU, where most of them include a small non-zero gradient when the unit is not active




Pros & cons

  • Provides a normalized probability-distribution over the activations
  • When viewed in a cross-entropy cost-model, the gradients for the loss-function is cheap computationally and numerical stable


  • Usually only used in the top (output) layer

Exponential Linear Unit (ELU)



where neural_networks_17d13b37eb97f5258609b9c4f50c8e4a147f4da2.png and is a hyper-parameter.

Pros & cons

  • Attempts to make the mean activations closer to zero which speeds up learning
  • All the pros of ReLU


  • Shown to have improved performance compared to standard ReLU

Scaled Exponential Linear Unit (SELU) [NEW]



Pros & cons

  • Allows us to construct a Self-normalizing Neural Network (SNN), which attempts to make the mean activations closer to zero and the variance of the activiations close to 1. This is supposed to (and experiments show) greatly increase the stability and efficiency of training.


Loss functions



  • neural_networks_97eb714dfbd8abb06c6ee1fb2cb049cdaa7defd1.png - ith training sample
  • neural_networks_23796db635a31d9c3fe202779289c1ad33e299ad.png -


  • Corresponds to cross-entropy loss


Derivtive wrt. cross-entropy loss

Cross-entropy loss function:


With neural_networks_0e2cefee83c1cebb77e23c50a71cf6762f5aeaf5.png being a softmax: neural_networks_678ceda8d86758c2148c80e27e16ff047dae6678.png

And thus,




And a bit more compactly,