# Notes on: Rahimi, A., & Recht, B. (2008): Uniform approximation of functions with random bases

## Table of Contents

## Notation

- and refers to discrete and continuous coefficients for the expansions
represents the "p-norm" which is defined as

refers to the

*empirical risk*the function on some dataset, which is given by- is used to refer to the function (usually in the space of continuous functions) which is the
*risk-minimizer*, i.e. given the data at hand is the function which minimzes the empirical risk - denotes the set of all square-integrable functions under the "p-norm".

## Approximation with Random Features

- is a family of functions on a compact set parametrized over the set

We also assume that for all and .

Then there's a couple of theorems stating upper-bounds on the pointwise convergence of the functions.

## Examples of random bases

Remember, we're just creating *features* here! In the paper, they end up performing linear optimization on the random-features to produce a linear decision-boundary which is then used for the actual classification. Basically, linera SVM with the randomized features as the input-vectors.

### Fourier Random Features

Let be a **shift-invariant** positive definite kernel-function. Then,

So, let's walk through this properly:

- First we write the kernel as an
*inverse Fourier transform*, where is the Fourier trasnform of- There's a theorem which says that a "Fourier transform of a positive definite function is also definite", and thus we can treat this as a psuedo-probability distribution
*psuedo-probability distribution*because I doubt it sums to one, but to sample from this we only need to know the "relative probability" of each of the , such as when using MCMC algortihms

- We then approximate the kernel by
*sampling*bases from the "frequency domain"- Observe that by random sampling from as a distribution, we will for each approximate "empirically" the coefficients of each

- Then finally we rewrite the sum as a dot-product between two vectors
- Take vectors and
- Project onto a random direction by taking the dot-product with this random vector.
- Take the exponential of it, "wrapping it around the unit-circle in the complex plane"
- Repeat step 2 and 3 for
- Repeat steps 2-4 times
- Take dot-product between and (where the dot-product of two complex vectors is defined as )

Here both and are vectors, thus represents the dot-product between the vectors and .

#### Guarantees

For a given and

where is just some arbitrary constant in the range .

This is due to Hoeffding.

I'm not 100% sure that the use of is correct here due to the lecture that I'm looking at having a horrible quality, but I REALLY believe it's just a constant which we can arbitrarily set.

Similar as in the Chebyshev's Inequality.

Which comes from the "Union bound" and thm:fourier-random-feature-lower-bound-probability

I'm not 100% if that's supposed to be the infinity-norm / max-norm, but I'm PRETTY sure!

where:

- is the second moment of the Fourier transform of
- is the dimension of the space
- is the diagonal matrix (?)

This last theorem is basically saying that if the dimensionality of your random feature-vector is large enough (i.e. you sample from the Fourier transform enough, referring to in the equation), **all the points in your space** will approximate the kernel function arbitrarily well. Thus, we have *uniform* convergence for large (with high probability).

All of these results are theorems where we need to have the **with high probability** implicit, but it's still awesomesauce stuff we're looking at here guys!

### Random Binning Features

This is also a pretty dope way of creating random features, which provides the same guarantess as in the Fourier Random Features.

This method is based on randomly sampling coefficients from a grid, where we construct the random vectors as one-hot encodings of where and "lands".

We do this by sampling from the following

where is a delta-like function.

Checkout the paper or the lecture at 19:40 for a better explanation, as the explanation is sooo much simpler to grasp with some graphics included.

### Difference Between Fourier and Binning

- Fourier Random Features works well for smooth surfaces
- Random Binning Features works well for "nearest-neighbors"-surfaces (i.e. surfaces where we would believe a "nearest-neighbors"-approach would work well)
- In fact, we can combine the random features extracted from both methods and use these as new feature-representations by simple concatenation! This gives us a nice combination, which might be attractive for certain problems

## Summary

Usually people go like this:

but the Random Kitchen Sink approach is simply to do:

## Sources

- Lecture held by Rahimi regarding the "next-level" Random sinks which are explained in a later paper. But "Random Sinks" are based on the stuff in this paper, and so we walks through a lot of the stuff in this paper in the referenced lecture. Therefore, it's a useful resource for understanding this paper too :)