I first learned this solution from Botond Cseke . I'm not sure where it originates; It is essentially Laplace's method for approximating integrals using a Gaussian distribution, where the parameters of the Gaussian distribution might come from any number of various approximate inference approaches.
If I have a Bayesian statistical model with hyperparameters $\Theta$, with a no closed-form posterior, how can I optimize $\Theta$?
Consider a Bayesian statistical model with observed data $\mathbf y\in\mathcal Y$ and hidden (latent) variables $\mathbf z\in\mathcal Z$, which we infer. We have a prior on $\mathbf z$, $\Pr(\mathbf z;\Theta)$, and a model for the probability of $\mathbf y$ given $\mathbf z$ (likelihood), $\Pr(\mathbf y|\mathbf z; \Theta)$. The prior and likelihood are controlled by "hyperparameters" $\Theta$, which we would like to estimate. Recall that Bayes theorem states:
\begin{equation} \begin{aligned} \Pr(\mathbf z|\mathbf y;\Theta) = \Pr(\mathbf y|\mathbf z; \Theta) \frac {\Pr(\mathbf z;\Theta)} {\Pr(\mathbf y;\Theta)} \end{aligned} \end{equation}It is common for the posterior $\Pr(\mathbf z|\mathbf y;\Theta)$ to lack a closed-form solution. In this case, one typically approximates the posterior with a more tractable distribution $Q(\mathbf z)\approx\Pr(\mathbf z|\mathbf y;\Theta)$. Common ways of estimating $(\mathbf z)$ include the Laplace approximation , variational Bayes , expectation propagation , and expectation maximization algorithms. The only approximating distribution in common use for high-dimensional $\mathbf z$ is the multivariate Gaussian (or some nonlinear transformation thereof), which succinctly captures joint statistics with limited computational overhead. Assume we have an inference procedure which returns the approximate posterior $Q(\mathbf z)=\mathcal N(\boldsymbol\mu_q,\boldsymbol\Sigma_q)$.
We optimize the hyperparameters "$\Theta$" of the prior kernel to maximize the marginal likelihood of the observations $\mathbf y$
\begin{equation} \begin{aligned} \theta &\gets\underset{\Theta}{\operatorname{argmax}} \Pr(\mathbf y; \Theta) \\ \Pr(\mathbf y; \Theta) &= \int_{\mathcal Z} \Pr(\mathbf y, \mathbf z; \Theta) \, d\mathbf z = \int_{\mathcal Z} \Pr(\mathbf y | \mathbf z) \Pr(\mathbf z; \Theta) \, d\mathbf z \end{aligned} \end{equation}Except in rare special cases, this integral does not have a closed form. However, we have already obtained a Gaussian approximation to the posterior distribution, $Q(\mathbf z)\approx \Pr(\mathbf z|\mathbf y;\Theta)$. If we replace $\Pr(\mathbf z|\mathbf y;\Theta)$ with our approximation $Q(\mathbf z)$ in this equation, we can solve for (an approximation) of $\Pr(\mathbf y;\Theta)$:
\begin{equation} \begin{aligned} Q(\mathbf z) \approx \Pr(\mathbf y|\mathbf z) \frac {\Pr(\mathbf z;\Theta)} {\Pr(\mathbf y;\Theta)} & \Rightarrow \Pr(\mathbf y;\Theta) \approx \Pr(\mathbf y|\mathbf z) \frac {\Pr(\mathbf z;\Theta)} {Q(\mathbf z)} \end{aligned} \end{equation}Working in log-probability, and evaluating the expression at the (approximated) posterior mean $\mathbf z = \boldsymbol\mu_q$, we get
\begin{equation} \begin{aligned} \ln\Pr(\mathbf z {=} \boldsymbol\mu_q;\Theta) &= -\tfrac 1 2 \left\{ \ln|2\pi\boldsymbol\Sigma_z| + (\boldsymbol\mu_q - \boldsymbol\mu_z)^\top \boldsymbol\Sigma_z^{-1} (\boldsymbol\mu_q - \boldsymbol\mu_z) \right\} \\ \ln Q(\mathbf z {=} \boldsymbol\mu_q) &= -\tfrac 1 2 \left\{ \ln|2\pi\boldsymbol\Sigma_q| + (\boldsymbol\mu_q - \boldsymbol\mu_q)^\top \boldsymbol\Sigma_q^{-1} (\boldsymbol\mu_q - \boldsymbol\mu_q) \right\} = -\tfrac 1 2 \ln|2\pi\boldsymbol\Sigma_q| \\~ \\~ \ln\Pr(\mathbf y ; \Theta) & \approx \ln\Pr(\mathbf y | \mathbf z {=} \boldsymbol\mu_q;\Theta) + \ln\Pr(\mathbf z {=} \boldsymbol\mu_q;\Theta) - \ln Q(\mathbf z {=} \boldsymbol\mu_q) \\ &= \ln\Pr(\mathbf y | \mathbf z {=} \boldsymbol\mu_q;\Theta) -\tfrac 1 2 \left\{ \ln|\boldsymbol\Sigma_q^{-1}\boldsymbol\Sigma_z| + (\boldsymbol\mu_q - \boldsymbol\mu_z)^\top \boldsymbol\Sigma_z^{-1} (\boldsymbol\mu_q - \boldsymbol\mu_z) \right\}. \end{aligned} \end{equation}This is quite tractable to compute.
No comments:
Post a Comment