Here’s a short post on probabilistic inference in Gaussian Bayesian Belief Networks. The math is elegant!
The junction tree algorithm (JTA
) is a widely used algorithm for exact inference in Bayesian Belief Networks (BBNs
). A great paper to learn the mechanics of JTA is authored by Huang and Darwiche. The Huang and Darwiche paper focuses only on discrete variables and leaves a lot to be desired if one has continuous variables.
The JTA may also be applied to continuous variables, and in particular, a set of Gaussian variables (multivariate Gaussian). The key idea in JTA is to transform a directed acylic graph (DAG
) into a junction tree (JT
). Once a JT is created, probabilistic inference to extract marginal probabilities (with or without evidence) is possible. Just like a BBN, which is composed of a graph (structure), G
, and joint probability (parameters), P
, a JT also has this type of duality. The graph of a JT is composed of nodes (cliques
and separation sets
) and edges (structure), and associated with each node is a potential
(parameters).
Potentials are the key to inference
in JTA. Inference is conducted in JTA by passing messages
(potentials) around. The representation of potentials is one of the two key differences between using JTA for discrete versus continuous variables (the other being how evidence is injected into the JT). For JTA with discrete variables, a potential is essentially a table. Below, we have a clique associated with the binary variables A
and B
. The potential associated with this clique would be the cross-product of the domains and a value (typically between $[0, 1]$) associated to each combination of the instantiations.
| A | B | value | |-----|-----|-------| | on | on | 0.2 | | on | off | 0.3 | | off | on | 0.4 | | off | off | 0.8 |
The representation of potentials in a JT for Gaussian variables is entirely different but based on the multivariate Gaussian density function. Let's look at the multivariate Gaussian density function.
$f_X(x_1, x_2, \ldots, x_n)=\cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}$
Where
This form of the multivariate Gaussian is called the covariance form
. In expanded form, the density function may be rewritten as follows.
$\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \end{split} \end{equation}$
Let's denote the following.
$K$ is called the precision matrix
or information matrix
, and it is the inverse of the covariance matrix. $h$ has a special name and depending on the context, is called the information vector
, observation vector
or even shift vector
. Note that $\mathbf{h}$ is a $n$ dimensional vector and $g$ evaluates to a scalar (a single number). Substituting $K$, $h$ and $g$ into the covariance form
, we can rewrite the multivariate density function as follows.
$\begin{equation} \begin{split} f_X(x_1, x_2, \ldots, x_n) & = \cfrac{\exp\left(\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\Sigma^{-1}(\mathbf{x}-\mathbf{\mu})\right)}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}} \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^T\Sigma^{-1}\mathbf{x} + \mathbf{\mu}^T\Sigma^{-1}\mathbf{x} - \frac{1}{2}\mathbf{\mu}^T\Sigma^{-1}\mathbf{\mu} - \log\left((2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}\right)\right) \\ & = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right) \end{split} \end{equation}$
This last form of the multivariate Gaussian is the potential, $\phi$, of a node in a JT of all Gaussian variables. This form is also called the canonical form
or information form
of the multivariate Gaussian.
$\phi(\mathbf{x}; K, \mathbf{h}, g) = \exp\left(-\frac{1}{2}\mathbf{x}^TK\mathbf{x} + \mathbf{h}^T\mathbf{x} + g\right)$
When marginalizing over a multivariate Gaussian distribution, it is easier to do so using the covariance form
. When conditioning against a multivariate Gaussian distribution, it is easier to do so using the information form
.
The three main operations we would like to do against $\phi$ are
We may multiply two potentials
as follows (omitting the reference to $\mathbf{x}$).
$\phi(K_1,\mathbf{h}_1, g_1) \cdot \phi(K_2,\mathbf{h}_2, g_2) = \phi(K_1 + K_2, \mathbf{h}_1 + \mathbf{h}_2, g_1 + g_2)$
We may divide two potentials
as follows.
$\cfrac{\phi(K_1,\mathbf{h}_1, g_1)}{\phi(K_2,\mathbf{h}_2, g_2)} = \phi(K_1 - K_2, \mathbf{h}_1 - \mathbf{h}_2, g_1 - g_2)$
Marginalization
over $\phi$ is integration and used in message passing. Assume two sets of disjoint variables $X$ and $Y$, then, the potential is written as $\phi(X,Y; K, \mathbf{h}, g)$. We may integrate $\phi(X,Y; K, \mathbf{h}, g)$ over $Y$ to give us $\phi(X; K', \mathbf{h}', g')$.
$\begin{equation} \begin{split} \int \phi(X,Y; K, \mathbf{h}, g) dY & = \phi(X; K', \mathbf{h}', g') \end{split} \end{equation}$
Where
When there is evidence
, $\phi(X,Y; K, \mathbf{h}, g)$ may be reduced to $\phi(X; K', \mathbf{h}', g')$, where
For two cliques $X$ and $Y$ with a separation set $R$ in between them, message passing from $X$ to $Y$ through $R$ consists of projection
and absorption
. The cliques and separation set will have the following denoted potentials: $\phi_X$, $\phi_Y$, and $\phi_R$.
Projection involves saving $\phi_R$ and replacing it through marginalization of $\phi_X$.
Absorption involves multiplication and division of the potentials.
In JTA, global propagation
uses message passing in two phases, collect-evidence
and distribute-evidence
to make the potentials locally consistent. In collect-evidence, we start by picking an initial clique (not separation set) in the JT and recursively travel to the terminal cliques (marking each one on the way). Once a terminal clique is reached, we pass messages back to the clique that came before it. In distribute-evidence, we start with the same initial clique and pass messages to its neighboring cliques, and then recursively call distribute-evidence to those neighboring cliques after.
After global propagation, for any variable whose density function we are interested, we pick a clique containing the variable and marginalize (over the other variables).