Timeseries modelling

Table of Contents

Autoregressive (AR)

Idea

Output variable depends linearly on its own previous values and on a stochastic term (accounting for the variance in the data).

Model

AR(p) refers to autoregressive model of order p, and is written

\begin{equation*}
X_t = c + \overset{p}{\underset{i=1}{\sum}} \varphi_i X_{t - i} + \varepsilon_i
\end{equation*}
1

where

  • $\varphi_1, \dots, \varphi_p$ are parameters
  • $c$ is a constant
  • $\varepsilon_t$ is white noise (Gaussian rv.), i.e. $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$

The unknowns are:

  • constant $c$
  • the weights / dependence $\varphi_i$ of the previous observations
  • the variance of the white noise $\sigma^2$

Estimating the parameters

  • Ordinary Least Squares (OLS)
  • Method of moments (through Yule-Walker equations)

There is a direct correspondance between the parameters $\varphi_i$ and the covariance of the process, and this can be "inverted" to determine the parameters from the autocorrelation function itself.

Yule-Walker equations

\begin{equation*}
\gamma_m = \overset{p}{\underset{k=1}{\sum}} \varphi_k \gamma_{m - k} + \sigma_{\varepsilon}^2 \delta_{m, 0}
\end{equation*}
2

where

  • $\varphi_m$ is the autocovariance function of $X_t$
  • $\sigma_\varepsilon$ is the std. dev. of the input noise process
  • $\delta_{m,0}$ is the Kroenecker delta function, which is only non-zero when $m = 0$

Now, if we only consider $m > 0$, we can ignore the last term, and can simply write the above in matrix form:

My thinking

Our plan is as follows:

  1. Consider each sequence of $p$ steps (the order of the AR)
  2. Deduce distribution for $X_t$ conditioned on the parameters of the model and the $p$ previous observations, $X_{t - p}, \dots, X_{t - 1}$.
  3. Obtain conditional log-likelihood function for this conditional distribution
  4. Get gradient of said conditional log-likelihood function
  5. Maximize said conditional log-likelihood wrt. parameters $c, \sigma, \varphi_1, \dots, \varphi_p$
  6. ???
  7. PROFIT!!!
Initial observation sequence

We start by collecting the first $p$ observations in a the sample $(X_1, ..., X_p)$, writing it as $X_{1:p}$, which as a mean vector $\mathbf{\mu}$ where

\begin{equation*}
\mu_k = \frac{c}{1 - \overset{p}{\underset{i=1}{\sum}} \varphi_i}, \quad \forall k = 1, ..., p
\end{equation*}
3

Why? Well, we start out with the taking the expectation of $X_p$ wrt. the random variables, i.e. $X_t$:

    \begin{equation*}
    E[X_{p}] = E[c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{p - i} + \varepsilon_{p}]
\end{equation*}
4

Then we start expanding:

    \begin{equation*}	
    \begin{split}
    E[X_p] &= E[c] + E[\overset{p}{\underset{k=1}{\sum}} \varphi_i X_{p - i}] + E[\varepsilon_p] \\
    &= c + \overset{p}{\underset{k=1}{\sum}} \varphi_i E[X_{p - i}] + 0, \quad E[\varepsilon_p] = 0 \\
    &= c + \overset{p}{\underset{k=1}{\sum}} \varphi_i \mu \\
    &= c + \mu \overset{p}{\underset{k=1}{\sum}}
    \end{split}
\end{equation*}
5

And since $E[X_p] = \mu$, we rearrange to end up with:

    \begin{equation*}
    \mu = \frac{c}{1 -	\overset{p}{\underset{k=1}{\sum}} \varphi_i}
\end{equation*}
6

Voilá!

and the variance-covariance matrix is given by

\begin{equation*}
\sigma^2 \Sigma = {\begin{bmatrix}\gamma _{0}&\gamma _{{1}}&\gamma _{{2}}&\dots &\gamma_{p - 1} \\ \gamma _{1}&\gamma _{0}&\gamma _{{1}}&\dots &\gamma_{p - 2} \\\gamma _{2}&\gamma _{{1}}&\gamma _{{0}}&\dots &\gamma_{p - 3} \\\vdots &\vdots &\vdots &\vdots &\vdots \\\gamma _{{p-1}}&\gamma _{{p-2}}&\gamma _{{p-3}}&\dots &\gamma_0 \\\end{bmatrix}}
\end{equation*}
7

where $\gamma_\tau = Cov(\tau) = Cov(|s - t|) = Cov(s, t)$ for two different steps $s$ and $t$. This is due to the process being weak-sense stationary.

The reason for the covariance matrix, is that by our assumption of weak-sense stationarity (is that a word? atleast it is now).

Conditional distribution

First we note the following:

\begin{equation*}
X_t = c + \overset{p}{\underset{i=1}{\sum}} \varphi_i X_{t - i} + \varepsilon_i, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)
\end{equation*}
8

is equivalent of

    \begin{equation*}
    X_t | X_{t - p: t} \sim N(c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i}, \sigma^2)
\end{equation*}
9

We then observe that if we condition $X_t$ on the previous $p$ observations, it is independent on everything before that. That is;

    \begin{equation*}
    p(X_t | X_1, \dots, X_{t-1}, c, \varphi, \sigma) = p(X_t | X_{t - p}, \dots, X_{t - 1}, c, \varphi, \sigma)
\end{equation*}
10

where $\varphi = (\varphi_1, ..., \varphi_p)$.

Therefore,

    \begin{equation*}
    p(X_t | X_{1:t-1}, c, \varphi, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} exp \Bigg( - \frac{1}{2 \sigma^2} \bigg( X_t - \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i} \bigg)^2 \Bigg)
\end{equation*}
11

For notational sake, we note that we can write it in the following form:

    \begin{equation*}
    \begin{split}
    p(X_t | X_{1:t-1}, c, \varphi, \sigma) &= \frac{1}{\sqrt{2 \pi \sigma^2}} exp \Bigg( - \frac{1}{2 \sigma^2} \bigg(c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i} + \varepsilon_t - c - \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i} \bigg)^2 \Bigg) \\
    &= \frac{1}{\sqrt{2 \pi \sigma^2}} exp \Big( - \frac{\varepsilon_t^2}{2 \sigma^2} \Big)
    \end{split}
\end{equation*}
12

We can write the total probability of seeing the data conditioned on the parameters as follows:

    \begin{equation*}
    p(X_1, ..., X_m | c, \varphi, \sigma) = p(X_m | X_{m - p: m}, c, \varphi, \sigma) ... p(X_{p + 1} | X_{1:p + 1}, c, \varphi, \sigma)
\end{equation*}
13

Which, when we substitute in the normal distributions obtained for $X_t$,

    \begin{equation*}
    \begin{split}
    p(X_1, ..., X_m | c, \varphi, \sigma) &= \frac{1}{(2 \pi \sigma^2)^{\frac{m - p}{2}}} exp \Big(- \frac{\varepsilon_m^2}{2 \sigma^2} \Big) \dots exp \Big(- \frac{\varepsilon_{p + 1}^2}{2 \sigma^2} \Big) \\
    &= \frac{1}{(2 \pi \sigma^2)^{\frac{m - p}{2}}} exp \Big( - \frac{1}{2 \sigma^2} \overset{m}{\underset{t=p + 1}{\sum} \varepsilon_t^2} \Big)
    \end{split}
\end{equation*}
14
Conditional log-likelihood

Taking the logarithm of the previous expression, we obtain the conditional log-likelihood:

    \begin{equation*}
    \mathcal{L}(c, \varphi, \sigma) = - \frac{m - p}{2} \big(log(2 \pi) + log(\sigma) \big) - \overset{m}{\underset{t=p + 1}{\sum}} \frac{\varepsilon_t^2}{2 \sigma^2}
\end{equation*}
15

And just to remind you; what we want to do is find the parameters which maximizes the condition log-likelihood. If we frame it as a minimization problem, simply by taking then egative of the log-likelihood, we end up with the objective function:

    \begin{equation*}
    \underset{c, \varphi, \sigma}{\text{argmin }} - \mathcal{L}(c, \varphi, \sigma) = \underset{c, \varphi, \sigma}{\text{argmin }} \frac{m - p}{2} \big(log(2 \pi) + log(\sigma) \big) + \frac{1}{2 \sigma^2} \overset{m}{\underset{t=p + 1}{\sum}} \Big( X_t - (c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i}) \Big)^2
\end{equation*}	 
16

Now, we optimize!

Optimization

If we only consider the objective function wrt. to $c$ and $\varphi_1, \dots, \varphi_p$, we end up with:

    \begin{equation*}
    \underset{c, \varphi}{\text{argmin }} - \mathcal{L}(c, \varphi, \sigma) = \underset{c, \varphi}{\text{argmin }} \overset{m}{\underset{t=p + 1}{\sum}} \Big( X_t - (c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i}) \Big)^2
\end{equation*}
17

This is what I arrived at independently, but really thought this was a huge issue because $X_t$ is defined as a function of the other variables and so we have a recursive relationship!

But then I realized, after confirming that the above is in fact what you want, that the $X\text{s}$ in this equation are observed values, not random variables, duh! Might be easier to think about it as $\hat{X_t} = c + \overset{p}{\underset{k=1}{\sum}} \varphi_i X_{t - i}$, and so we're looking at $(X_t - \hat{X_t})^2$, since the last term is in fact our estimate for $X_t$.

Disclaimer: I think…

Thus, the conditional MLE of these parameters can be obtained from an OLS regression of $X_t$ on a constant $c$ and $p$ of its own lagged values.

The conditional MLE estimator of $\sigma^2$ turns out to be:

\begin{equation*}
\hat{\sigma}^2 = \frac{1}{m - p} \overset{m}{\underset{t = p + 1}{\sum}} \Big( \hat{X_t} - (\hat{c} + \overset{p}{\underset{k=1}{\sum}} \hat{\varphi_i} \hat{X}_{t - i}) \Big)^2
\end{equation*}
18

This can be solved with iterative methods like:

  • Gauss-Newton algorithm
  • Gradient Descent algorithm

Or by finding the exact solution using linear algebra, but the naive approach to this is quite expensive. There are definitively libraries which do this efficiently but nothing that I can implement myself, I think.

Moving average (MA)

Autoregressive Moving average (ARMA)

Arma(p, q) refers to the model with $p$ autoregressive terms and $q$ moving-average terms. This model contains the AR(p) and MA(q) models,

\begin{equation*}
X_t = c + \varepsilon_t + \sum_{i=1}^{p} \varphi_i X_{t - i} + \sum_{i=1}^{q} \theta_i \varepsilon_{t - i}
\end{equation*}

Autoregressive Conditional Heteroskedasticity (ARCH)

Notation

  • Heteroskedastic refers to sub-populations of a collection of rvs. having different variablility

Stuff

  • Describes variance of current error term as a function of the actual sizes of the previous time periods' error terms
  • If we assume the error variance is described by an ARMA model, then we have Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model

Let $\varepsilon_t$ denote a real-valued discrete-time stochastic process, and $\psi_t$ and be the information set of all information up to time $t$, i.e. $\psi_t = \sigma \left\{ \dots, \varepsilon_{t - 2}, \varepsilon_{t - 1}, \varepsilon_t \right\}$.

The process $\{ \varepsilon_t \}$ is said to be a Autoregressive Conditional Heteroskedasticitic model or ARCH(q), whenever

\begin{equation*}
\mathbb{E}[\varepsilon_t \mid \psi_{t - 1}] = 0 \quad \text{and} \quad \text{Var}(\varepsilon_t \mid \psi_{t - 1}) = h_t
\end{equation*}

with

\begin{equation*}
h_t = \alpha_0 + \sum_{i = 1}^{q} \alpha_i \varepsilon_{t - i}^2, \quad \alpha_i \ge 0
\end{equation*}

The conditional variance can generally be expressed as

\begin{equation*}
h_t = h(\varepsilon_{t - 1}, \varepsilon_{t - 2}, \dots, \varepsilon_{t - q}, \boldsymbol{\alpha})
\end{equation*}

where $h(\cdot)$ is a nonnegative function of its arguments and $\boldsymbol{\alpha} = (\alpha_0, \alpha_1, \dots, \alpha_q)^T$ is a vector of unknown parameters.

The above is sometimes represented as

\begin{equation*}
\varepsilon_t = h_t^{1 / 2} z_t
\end{equation*}

with $\mathbb{E}[z_t] = 0$ and $\text{Var}(z_t) = 0$, where the $z_t$ are uncorrelated.

We can estimate an ARCH(q) model using OLS.

The Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model or GARCH(p, q) is simply an ARCH(q) in combination with a ARMA(p) model, i.e.

\begin{equation*}
\begin{split}
  h_t &= \alpha_0 + \sum_{i=1}^{q} \alpha_i \varepsilon_{t - i}^2 + \sum_{j=1}^{p} \beta_j h_{t - j} \\
  &= \alpha_0 + \alpha(L) \varepsilon_t^2 + \beta(L) h_t
\end{split}
\end{equation*}

where $\alpha(L)$ and $\beta(L)$ are lag operators such that

\begin{equation*}
\begin{split}
  \alpha(L) &= \alpha_1 L + \alpha_2 L^2 + \dots + \alpha_q L^q \\
  \beta(L) &= \beta_1 L + \beta_2 L^2 + \dots + \beta_p L^p
\end{split}
\end{equation*}

Or equivalently,

\begin{equation*}
\varepsilon_t^2 = \bigg( \alpha_0 + \sum_{i=1}^{q} \alpha_i \varepsilon_{t - i}^2 \bigg) + \bigg( \sum_{j=1}^{p} \beta_j \varepsilon_{t - j} \bigg) - \bigg( \sum_{j=1}^{p} \beta_j \nu_{t - j} \bigg) + \nu_t
\end{equation*}

where

\begin{equation*}
\nu_t = \varepsilon_t^2 - h_t = (z_t^2 - 1)h_t
\end{equation*}

which simply ensures that

Integrated GARCH (IGARCH)

Fractional Integrated GARCH (FIGARCH)

TODO Markov Switching Multifractal (MSM)

Useful tricks

Autocorrelation using Fourier Transform

Observe that the autocorrelation for a function $C(t)$ can be computed as follows:

\begin{equation*}
(C * C)(t) = \int_{0}^{t} C(t - s) C(s) \ ds
\end{equation*}

since this is just summing over the interactions between every value $s$. If $C: [0, t] \to \mathbb{R}$, then the above defines the autocorrelation between the series, as wanted.

Now, observe that the [BROKEN LINK: No match for fuzzy expression: def:fourier-transform] has the property:

\begin{equation*}
\mathcal{F} \left\{ C * C \right\} = \mathcal{F}\left\{ C \right\} \mathcal{F} \left\{ C \right\} = \Big( \mathcal{F} \left\{ C \right\} \Big)^2
\end{equation*}

Hence,

\begin{equation*}
(C * C)(t) = \mathcal{F}^{-1} \left\{ \big( \mathcal{F} \left\{ C \right\} \big)^2 \right\}
\end{equation*}

Which means that, given a dataset, we can use the efficient Fast Fourier Transform (FFT) to compute $\mathcal{F} \left\{ C \right\}$, square it, and then take the inverse FFT, to obtain $(C*C)(t)$, i.e. the autocorrelation!

This is super-dope.

Appendix A: Vocabulary

autocorrelation
also known as serial correlation, and is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, similarity between observations as a function of the time lag between them.
wide-sense stationary
also known as weak stationary, only require that the mean (1st moment) and autocovariance do not vary wrt. time. This is in contrast to stationary processes, where we require the joint probability to not vary wrt. time.