Notes on: Rahimi, A., & Recht, B. (2008): Uniform approximation of functions with random bases

Table of Contents


  • rahimi_2008_6099d7787556bd805b3b6b41d9bd705c84818945.png and rahimi_2008_96833d25b41a94f218297a8394bf7ce09ba2ddc6.png refers to discrete and continuous coefficients for the expansions
  • rahimi_2008_9014505fa7e6c1422cb3b50c3086921e0ad91ce5.png represents the "p-norm" which is defined as


  • rahimi_2008_ec96fc11f0dfa9f78144816e8307726cd4a194a0.png refers to the empirical risk the function rahimi_2008_cdd1cc131da6040eca078917132a377727053c44.png on some dataset, which is given by


  • rahimi_2008_e52a598b1b89688e78c8f8933a8bfef6600de168.png is used to refer to the function (usually in the space of continuous functions) which is the risk-minimizer, i.e. given the data at hand rahimi_2008_e52a598b1b89688e78c8f8933a8bfef6600de168.png is the function which minimzes the empirical risk
  • rahimi_2008_5c701fcf8b65e8a201f49ca466e510dcd77ab4b4.png denotes the set of all square-integrable functions under the "p-norm".

Approximation with Random Features

  • rahimi_2008_e361c2a6d83a662555b3837af114552b16bc5a6a.png is a family of functions on a compact set rahimi_2008_e8324354751e1665822bd2155382e058ecaa7d2c.png parametrized over the set rahimi_2008_2a3127fb707893083ca58b0c58c9ba5ef423ac3d.png
  • rahimi_2008_e3091b93dc0e0ca1f15135c00dbfcd14d773eb99.png

We also assume that rahimi_2008_fdd682d15c45d28e2af8bb3980f64461ceed0c82.png for all rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png and rahimi_2008_e7a86163f39fa21d4a2ed66946369cdeb900ef42.png.

Then there's a couple of theorems stating upper-bounds on the pointwise convergence of the functions.

Examples of random bases

Remember, we're just creating features here! In the paper, they end up performing linear optimization on the random-features to produce a linear decision-boundary which is then used for the actual classification. Basically, linera SVM with the randomized features as the input-vectors.

Fourier Random Features

Let rahimi_2008_7051f38a99b84dc91829bed65be887da34a4fd82.png be a shift-invariant positive definite kernel-function. Then,


So, let's walk through this properly:

  • First we write the kernel as an inverse Fourier transform, where rahimi_2008_c4836f039a2619abccfd68cdf3e9e2f2ed182106.png is the Fourier trasnform of rahimi_2008_221cc4bd55a72a1143c04d8c90ba44b5a76e6b98.png
    • There's a theorem which says that a "Fourier transform of a positive definite function is also definite", and thus we can treat this rahimi_2008_c4836f039a2619abccfd68cdf3e9e2f2ed182106.png as a psuedo-probability distribution
    • psuedo-probability distribution because I doubt it sums to one, but to sample from this we only need to know the "relative probability" of each of the rahimi_2008_417ed021efbf166c40fb9aa04350690731bbc6ec.png, such as when using MCMC algortihms
  • We then approximate the kernel by sampling bases from the "frequency domain"
    • Observe that by random sampling from rahimi_2008_c4836f039a2619abccfd68cdf3e9e2f2ed182106.png as a distribution, we will for each rahimi_2008_417ed021efbf166c40fb9aa04350690731bbc6ec.png approximate "empirically" the coefficients of each rahimi_2008_41d9168f1128614871eb25914844f55d1eba59c8.png
  • Then finally we rewrite the sum as a dot-product between two vectors
  • Take vectors rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png and rahimi_2008_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png
  • Project rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png onto a random direction rahimi_2008_b93161e4c053987c78844e395d07f5e814b174ab.png by taking the dot-product with this random vector.
  • Take the exponential of it, "wrapping it around the unit-circle in the complex plane"
  • Repeat step 2 and 3 for rahimi_2008_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png
  • Repeat steps 2-4 rahimi_2008_b689cba8d7566f6adaf605a844e193a27e155078.png times
  • Take dot-product between rahimi_2008_2df482712e7f33ac33d34c9c93d9e43e5d8a52c7.png and rahimi_2008_79c883ceecb916f9df9e39be5f36f7b5c84430ee.png (where the dot-product of two complex vectors rahimi_2008_ab8cc84423ffd4e898d9fb96d4591c2724ea15c1.png is defined as rahimi_2008_e0163d14f0eb43e20b906fdc9cd834c2b1295549.png)

Here rahimi_2008_0c3b5ab1b87f5e9548717206e50c67a4fae2af1e.png both and rahimi_2008_b93161e4c053987c78844e395d07f5e814b174ab.png are vectors, thus rahimi_2008_cc1d5c810641e5e96a493c9fc7317447a73307e0.png represents the dot-product between the vectors rahimi_2008_b93161e4c053987c78844e395d07f5e814b174ab.png and rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png.


For a given rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png and rahimi_2008_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png


where rahimi_2008_03d96c52df5ea6a510607e2260e0745d11e4bd85.png is just some arbitrary constant in the range rahimi_2008_dda7f1f80776b1fdaf9f1c120263185e57162a56.png.

This is due to Hoeffding.

I'm not 100% sure that the use of rahimi_2008_03d96c52df5ea6a510607e2260e0745d11e4bd85.png is correct here due to the lecture that I'm looking at having a horrible quality, but I REALLY believe it's just a constant which we can arbitrarily set.

Similar as in the Chebyshev's Inequality.


Which comes from the "Union bound" and thm:fourier-random-feature-lower-bound-probability

I'm not 100% if that's supposed to be the infinity-norm / max-norm, but I'm PRETTY sure!



  • rahimi_2008_debaeb2f9e6615e30e574dc8fe34726872e29afb.png is the second moment of the Fourier transform of rahimi_2008_094b02afce734f4ce51933d0093ef3d2da9f8123.png
  • rahimi_2008_24db75342343379bc61072ed218d34613269537e.png is the dimension of the space
  • rahimi_2008_f9bafc92da67b6c0622c0443f04ce3e7bd748aeb.png is the diagonal matrix (?)

This last theorem is basically saying that if the dimensionality of your random feature-vector is large enough (i.e. you sample from the Fourier transform enough, referring to rahimi_2008_b689cba8d7566f6adaf605a844e193a27e155078.png in the equation), all the points in your space will approximate the kernel function rahimi_2008_094b02afce734f4ce51933d0093ef3d2da9f8123.png arbitrarily well. Thus, we have uniform convergence for large rahimi_2008_b689cba8d7566f6adaf605a844e193a27e155078.png (with high probability).

All of these results are theorems where we need to have the with high probability implicit, but it's still awesomesauce stuff we're looking at here guys!

Random Binning Features

This is also a pretty dope way of creating random features, which provides the same guarantess as in the Fourier Random Features.

This method is based on randomly sampling coefficients from a grid, where we construct the random vectors as one-hot encodings of where rahimi_2008_3c314f80373742988ad542a6d4ce66a111b17847.png and rahimi_2008_a3a7f43f807b9e381fc50e0fab140c0df0a03e17.png "lands".

We do this by sampling from the following


where rahimi_2008_8dfd86a78bb5964f7b87c191bd04044344377f0b.png is a delta-like function.

Checkout the paper or the lecture at 19:40 for a better explanation, as the explanation is sooo much simpler to grasp with some graphics included.

Difference Between Fourier and Binning

  • Fourier Random Features works well for smooth surfaces
  • Random Binning Features works well for "nearest-neighbors"-surfaces (i.e. surfaces where we would believe a "nearest-neighbors"-approach would work well)
  • In fact, we can combine the random features extracted from both methods and use these as new feature-representations by simple concatenation! This gives us a nice combination, which might be attractive for certain problems


Usually people go like this:


but the Random Kitchen Sink approach is simply to do: