# Notes on: Rahimi, A., & Recht, B. (2008): Uniform approximation of functions with random bases

## Notation

• and refers to discrete and continuous coefficients for the expansions
• represents the "p-norm" which is defined as • refers to the empirical risk the function on some dataset, which is given by • is used to refer to the function (usually in the space of continuous functions) which is the risk-minimizer, i.e. given the data at hand is the function which minimzes the empirical risk
• denotes the set of all square-integrable functions under the "p-norm".

## Approximation with Random Features

• is a family of functions on a compact set parametrized over the set • We also assume that for all and .

Then there's a couple of theorems stating upper-bounds on the pointwise convergence of the functions.

## Examples of random bases

Remember, we're just creating features here! In the paper, they end up performing linear optimization on the random-features to produce a linear decision-boundary which is then used for the actual classification. Basically, linera SVM with the randomized features as the input-vectors.

### Fourier Random Features

Let be a shift-invariant positive definite kernel-function. Then, So, let's walk through this properly:

• First we write the kernel as an inverse Fourier transform, where is the Fourier trasnform of • There's a theorem which says that a "Fourier transform of a positive definite function is also definite", and thus we can treat this as a psuedo-probability distribution
• psuedo-probability distribution because I doubt it sums to one, but to sample from this we only need to know the "relative probability" of each of the , such as when using MCMC algortihms
• We then approximate the kernel by sampling bases from the "frequency domain"
• Observe that by random sampling from as a distribution, we will for each approximate "empirically" the coefficients of each • Then finally we rewrite the sum as a dot-product between two vectors
• Take vectors and • Project onto a random direction by taking the dot-product with this random vector.
• Take the exponential of it, "wrapping it around the unit-circle in the complex plane"
• Repeat step 2 and 3 for • Repeat steps 2-4 times
• Take dot-product between and (where the dot-product of two complex vectors is defined as )

Here both and are vectors, thus represents the dot-product between the vectors and .

#### Guarantees

For a given and  where is just some arbitrary constant in the range .

This is due to Hoeffding.

I'm not 100% sure that the use of is correct here due to the lecture that I'm looking at having a horrible quality, but I REALLY believe it's just a constant which we can arbitrarily set.

Similar as in the Chebyshev's Inequality. Which comes from the "Union bound" and thm:fourier-random-feature-lower-bound-probability

I'm not 100% if that's supposed to be the infinity-norm / max-norm, but I'm PRETTY sure! where:

• is the second moment of the Fourier transform of • is the dimension of the space
• is the diagonal matrix (?)

This last theorem is basically saying that if the dimensionality of your random feature-vector is large enough (i.e. you sample from the Fourier transform enough, referring to in the equation), all the points in your space will approximate the kernel function arbitrarily well. Thus, we have uniform convergence for large (with high probability).

All of these results are theorems where we need to have the with high probability implicit, but it's still awesomesauce stuff we're looking at here guys!

### Random Binning Features

This is also a pretty dope way of creating random features, which provides the same guarantess as in the Fourier Random Features.

This method is based on randomly sampling coefficients from a grid, where we construct the random vectors as one-hot encodings of where and "lands".

We do this by sampling from the following where is a delta-like function.

Checkout the paper or the lecture at 19:40 for a better explanation, as the explanation is sooo much simpler to grasp with some graphics included.

### Difference Between Fourier and Binning

• Fourier Random Features works well for smooth surfaces
• Random Binning Features works well for "nearest-neighbors"-surfaces (i.e. surfaces where we would believe a "nearest-neighbors"-approach would work well)
• In fact, we can combine the random features extracted from both methods and use these as new feature-representations by simple concatenation! This gives us a nice combination, which might be attractive for certain problems

## Summary

Usually people go like this: but the Random Kitchen Sink approach is simply to do: 