Notes on: Osogami, T. (2017): Boltzmann machines for time-series

1 Notation

• $$\mathbf{x} = \Big( \mathbf{x}^{[t]} \Big)_{t = 0}^T$$ denotes multi-dimensional time-series
• $$\mathbf{x}^{[s, t]}$$ denotes the time-series of the patterns from time $$s$$ to $$t$$
• Log-likelihood is then

\begin{equation*} f(\theta) = \log \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{t=0}^{T} \log \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]}} \Big) \end{equation*}
• Energy-based models

\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{\tilde{\mathbf{h}}} \mathbb{P}_{\theta}(\mathbf{x}, \tilde{\mathbf{h}}) \end{equation*}

where

\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}, \mathbf{h}) = \frac{\exp \Big( - E_{\theta}(\mathbf{x}, \mathbf{h}) \Big)}{\sum_{\tilde{\mathbf{x}}}^{} \sum_{\tilde{\mathbf{h}}}^{} \exp \Big( - E_{\theta}(\tilde{\mathbf{x}}, \tilde{\mathbf{h}}) \Big)} \end{equation*}
• $$\mathbf{X}$$ denotes the random time-series
• $$\mathbf{H}$$ denotes the random hidden values
• $$\mathbb{E}_{\theta}$$ denotes the expectation wrt. the model distribution $$\mathbb{P}_{\theta}$$
• $$\mathbb{E}_{\text{target}}$$ denotes the expectation wrt. the target distributuion

\begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\text{target}} \Big[ \mathbb{E}_{\theta} \big[ \nabla E_{\theta} (\mathbf{X}, \mathbf{H}) \mid \mathbf{X} \big] \Big] + \mathbb{E}_{\theta} \big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \big] \end{equation*}

while a single time-series $$\mathbf{x}$$ as target gives

\begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{x}, \mathbf{H}) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \Big] \end{equation*}
• When we want to maximize the log-likelihood of the time-series we get (for a single $$\mathbf{x}$$)

\begin{equation*} \nabla f_t(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{x}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{X}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] \end{equation*}
• $$D$$ denotes that we're aiming to model a D-th Markov model, i.e. one where

\begin{equation*} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]} \Big) = \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]} \Big) \end{equation*}
• $$\mathbf{H}^{[< t]}(w)$$ denotes the hidden units at some previous time-step weighted by some $$w$$

2 Temporal Restricted Boltzmann Machines (TRBMs)

• TRBM with parameter $$\theta$$ defines the probability distribution

\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \prod_{t = 0}^{T} \sum_{\tilde{\mathbf{h}}^{[t]}}^{} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]}, \tilde{\mathbf{h}}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]}, \mathbf{r}^{[t - D, t - 1]} \Big) \end{equation*}
• $$\mathbf{r}^{[t]}$$ are the expectation of the hidden values

\begin{equation*} \mathbf{r}^{[t]} = \mathbb{E}_{\tehta} \Big[ \mathbf{H}^{[t]} \mid \mathbf{x}^{[0, t]} \Big] \end{equation*}