Notes on: Osogami, T. (2017): Boltzmann machines for time-series
Table of Contents
1 Notation
- \(\mathbf{x} = \Big( \mathbf{x}^{[t]} \Big)_{t = 0}^T\) denotes multi-dimensional time-series
- \(\mathbf{x}^{[s, t]}\) denotes the time-series of the patterns from time \(s\) to \(t\)
Log-likelihood is then
\begin{equation*} f(\theta) = \log \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{t=0}^{T} \log \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]}} \Big) \end{equation*}Energy-based models
\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \sum_{\tilde{\mathbf{h}}} \mathbb{P}_{\theta}(\mathbf{x}, \tilde{\mathbf{h}}) \end{equation*}where
\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}, \mathbf{h}) = \frac{\exp \Big( - E_{\theta}(\mathbf{x}, \mathbf{h}) \Big)}{\sum_{\tilde{\mathbf{x}}}^{} \sum_{\tilde{\mathbf{h}}}^{} \exp \Big( - E_{\theta}(\tilde{\mathbf{x}}, \tilde{\mathbf{h}}) \Big)} \end{equation*}- \(\mathbf{X}\) denotes the random time-series
- \(\mathbf{H}\) denotes the random hidden values
- \(\mathbb{E}_{\theta}\) denotes the expectation wrt. the model distribution \(\mathbb{P}_{\theta}\)
- \(\mathbb{E}_{\text{target}}\) denotes the expectation wrt. the target distributuion
Gradient of log-likelihood is
\begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\text{target}} \Big[ \mathbb{E}_{\theta} \big[ \nabla E_{\theta} (\mathbf{X}, \mathbf{H}) \mid \mathbf{X} \big] \Big] + \mathbb{E}_{\theta} \big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \big] \end{equation*}while a single time-series \(\mathbf{x}\) as target gives
\begin{equation*} \nabla f(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{x}, \mathbf{H}) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta}(\mathbf{X}, \mathbf{H}) \Big] \end{equation*}When we want to maximize the log-likelihood of the time-series we get (for a single \(\mathbf{x}\))
\begin{equation*} \nabla f_t(\theta) = - \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{x}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] + \mathbb{E}_{\theta} \Big[ \nabla E_{\theta} \Big( \mathbf{X}^{[t]}, \mathbf{H} \mid \mathbf{x}^{[0, t-1]} \Big) \Big] \end{equation*}\(D\) denotes that we're aiming to model a D-th Markov model, i.e. one where
\begin{equation*} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[0, t - 1]} \Big) = \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]} \Big) \end{equation*}- \(\mathbf{H}^{[< t]}(w)\) denotes the hidden units at some previous time-step weighted by some \(w\)
2 Temporal Restricted Boltzmann Machines (TRBMs)
TRBM with parameter \(\theta\) defines the probability distribution
\begin{equation*} \mathbb{P}_{\theta}(\mathbf{x}) = \prod_{t = 0}^{T} \sum_{\tilde{\mathbf{h}}^{[t]}}^{} \mathbb{P}_{\theta} \Big( \mathbf{x}^{[t]}, \tilde{\mathbf{h}}^{[t]} \mid \mathbf{x}^{[t - D, t - 1]}, \mathbf{r}^{[t - D, t - 1]} \Big) \end{equation*}\(\mathbf{r}^{[t]}\) are the expectation of the hidden values
\begin{equation*} \mathbf{r}^{[t]} = \mathbb{E}_{\tehta} \Big[ \mathbf{H}^{[t]} \mid \mathbf{x}^{[0, t]} \Big] \end{equation*}