Notes on: Osogami, T. (2017): Boltzmann machines for time-series

Table of Contents

1 Notation

  • \(\mathbf{x} = \Big( \mathbf{x}^{[t]} \Big)_{t = 0}^T\) denotes multi-dimensional time-series
  • \(\mathbf{x}^{[s, t]}\) denotes the time-series of the patterns from time \(s\) to \(t\)
  • Log-likelihood is then

    \begin{equation*} f(\theta) = \log \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{t=0}^{T} \log \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]}} \Big) \end{equation*}
  • Energy-based models

    \begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{\tilde{\mathbf{h}}} \mathbb{P}_{\theta}(\mathbf{x}, \tilde{\mathbf{h}}) \end{equation*}


    \begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}, \mathbf{h}) = \frac{\exp \Big( - E_{\theta}(\mathbf{x}, \mathbf{h}) \Big)}{\sum_{\tilde{\mathbf{x}}}^{} \sum_{\tilde{\mathbf{h}}}^{} \exp \Big( - E_{\theta}(\tilde{\mathbf{x}}, \tilde{\mathbf{h}}) \Big)} \end{equation*}
  • \(\mathbf{X}\) denotes the random time-series
  • \(\mathbf{H}\) denotes the random hidden values
  • \(\mathbb{E}_{\theta}\) denotes the expectation wrt. the model distribution \(\mathbb{P}_{\theta}\)
  • \(\mathbb{E}_{\text{target}}\) denotes the expectation wrt. the target distributuion
  • Gradient of log-likelihood is

    \begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\text{target}} \Big[ \mathbb{E}_{\theta} \big[ \nabla E_{\theta} (\mathbf{X}, \mathbf{H}) \mid \mathbf{X} \big] \Big] + \mathbb{E}_{\theta} \big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \big] \end{equation*}

    while a single time-series \(\mathbf{x}\) as target gives

    \begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{x}, \mathbf{H}) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \Big] \end{equation*}
  • When we want to maximize the log-likelihood of the time-series we get (for a single \(\mathbf{x}\))

    \begin{equation*} \nabla f_t(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{x}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{X}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] \end{equation*}
  • \(D\) denotes that we're aiming to model a D-th Markov model, i.e. one where

    \begin{equation*} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]} \Big) = \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]} \Big) \end{equation*}
  • \(\mathbf{H}^{[< t]}(w)\) denotes the hidden units at some previous time-step weighted by some \(w\)

2 Temporal Restricted Boltzmann Machines (TRBMs)

  • TRBM with parameter \(\theta\) defines the probability distribution

    \begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \prod_{t = 0}^{T} \sum_{\tilde{\mathbf{h}}^{[t]}}^{} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]}, \tilde{\mathbf{h}}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]}, \mathbf{r}^{[t - D, t - 1]} \Big) \end{equation*}
  • \(\mathbf{r}^{[t]}\) are the expectation of the hidden values

    \begin{equation*} \mathbf{r}^{[t]} = \mathbb{E}_{\tehta} \Big[ \mathbf{H}^{[t]} \mid \mathbf{x}^{[0, t]} \Big] \end{equation*}

3 Dynamic RBM (DyRBM) with hidden units