Epilogue: These notes were part of a project to infer unobserved ("latent") states in a neural field model, as well as the parameters of that model, from spiking observations. It has since been published. Ultimately, for speed we ended up selecting the Laplace approximation for the measurement update, solved via Newton-Raphson.
We are interested in approximations to the measurement update for a spatially-extended latent-variable point-process, where the latent variables are also spatial fields that undergo dynamics similar to chemical reaction-diffusion systems.
The latent variables or fields are concentrations, activations, or some similar physical quantity. They are therefore constrained to be non-negative, and also typically must obey conservation laws. Additionally, the observed point-process intensity field must also be constrained to be non-negative.
Such systems arise in chemical reaction diffusion systems, epidemiological models, and neural field models, where the measurement is a point-process that is coupled indirectly to the latent spatiotemporal system.
Problem statement
We will use a multivariate Gaussian approximation to model the joint distribution of latent variables. In the continuous filed case, this is a Gaussian process. In the numerical implementation, we project this process onto a finite basis to give a finite-dimensional multivariate Gaussian distribution. The filtering update therefore requires finding a multivariate Gaussian approximation to the non-conjugate update of a Poisson observation and a multivariate Gaussian prior.
Estimating this posterior involves an integral that becomes intractable in high-dimensions. There are numerous approximation methods to handle this, including moment-matching, variational Bayes, expected log-likelihoods, and the Laplace approximation. Often, the choice of link function furthers constrain which methods are applicable, as efficient algorithms may only be available in some classes. The update must be also constrained to prevent negative activity in the estimated latent states.
Multivariate Gaussian model
Let
Latent field definition and discretization
This activation field is mapped to point-process intensities via a link function
In practice, we project this continuous, infinite-dimensional process onto a finite set of discrete basis elements
where
where
where
We introduce the gain and bias parameters to decouple inhomogeneity in the measurements from the underlying, latent spatiotemporal process. For example, in the case of retinal waves, different regions have differing densities of retinal ganglion cells, which amounts to a spatially inhomogeneous gain. Additionally, the amount of spontaneously-active background activity varies. In order to build up mathematical solutions that are immediately useful for numerical implementation, we carry-through these bias and gain parameters in the derivations that follow.
Count observations and the measurement posterior
Given the Gaussian prior, the Posterior estimate of the latent activations
We observe event counts over a finite number of spatial regions, and these observed counts are independent conditioned on the per-region intensity, so we can write:
The dependence of the counts on the latent activation can be expanded to include the regional intensity
Since the dependence of
We observe regional counts
The log-posterior
Consider the logarithmic form of the measurement update:
The prior
The conditional log-likelihood is given by the Poisson observation model with regional intensity
The marginal log-likelihood of the count observations
Approximation methods
In this section, we explore various approaches to obtaining a Gaussian approximation to the posterior
Laplace approximation
For the Laplace approximation, we find the mode of the posterior and interpret the curvature at this mode as the inverse of the covariance matrix. Since we are dealing with spatiotemporal processes driven by physical agents (e.g. molecules, neurons, humans), we constrain the posterior mode to be non-negative. This departs slightly from the traditional Laplace approximation, in which the posterior mode is a non-extremal local maximum with zero slope. For this reason, the interpretation of the curvature at the mode as the inverse of the covariance must be treated with caution.
This can be solved with gradient descent or the Newton-Raphson method, which requires the gradient and Hessian of the log-posterior with respect to
We introduce some abbreviations to simplify the notation in the derivations that follow. Denote log-probabilities "
Note that
where
In the case that
The exponential link function
In the case that
For more flexibility, one might add another gain parameter inside the exponentiation, i.e.
The quadratic link function
Let
A more flexible parameterization is
Logistic link 1
We might want to consider the logistic link function, which maps the range
Logistic link 2
If the activation is bounded on
where
Variational approximation
In the variational approximation, we find a Gaussian distribution
To obtain a tractable form of the above, first expand
This gives convenient simplifications, as the prior
Dropping the constant
The objective function for variational Bayes amounts to maximizing the data likelihood
In this application both the prior and approximating posterior are Gaussian, and the KL divergence term
where
Challenges for the variational approximation in this application
The variational approximation integrates over the domain for
One may relax the constraint that
(An implementation of variational optimization using the rectifying quadratic link function may be more numerically stable, and remains to be explored.)
Moment-matching
Moment matching calculates or approximates the mean and covariance of the true posterior, and uses these moments to form a multivariate Gaussian approximation. When applied as a message-passing algorithm in a graphical model, moment matching is an important step of the expectation-propagation algorithm. Moment-matching can be performed explicitly by integrating the posterior moments, but in high dimensions there is no computationally tractable way to evaluate such integrals. Since spatial correlations are essential in spatiotemporal phenomena, we cannot discard this higher dimensional structure.
Another approach to moment-matching to note is that the the Gaussian distribution
Note that the first term is the entropy of the (true) posterior distribution. It is constant for a given update, and therefore does not affect our optimization. We can focus on the second term, and optimize:
The log-probability of a Gaussian approximation
We cannot calculate the true posterior
This integral, however, remains essentially intractable, due to the product of Gaussian and Poisson-like terms. In high dimension there is no (to our knowledge) computationally efficient way to estimate this integral or its derivatives.
Expected log-likelihoods and variational Bayes
So far, we have explored three approaches to finding a Gaussian approximation to the measurement posterior: the Laplace approximation, variational Bayes, and moment matching. Moment matching is unsuitable because, to the best of our knowledge, there is no computationally tractable way to estimate the relevant moments in high-dimensions. The Laplace approximation and variational Bayes remain computationally tractable, with limitations.
The Laplace approximation suffers from errors arising from the non-negativity constraint on activity levels, and the high skewness of the distributions causes the mode to be far from the mean. Errors in estimating the covariance are especially severe, as the covariance controls the trade-off between propagating past information, and incorporating new measurements, during the filtering procedure.
Variational Bayes also has a number of challenges. First, an efficient way of evaluating the relevant integrals and their derivatives is needed. In practice, this is simplest when we can tolerate the approximation of integrating over the full domain of the prior, including the unphysical negative activations. We must also use a rectifying link function, because if the predicted point-process intensity is negative, then the Poisson likelihood for our observations is not defined.
To our knowledge, the only rectifying link function that has been explored to-date is the exponential link function, which suffers from unacceptable numerical instability in our application. There are a few other link functions that we might explore, but we will address another approximate solution in this section based on Laplace-approximated expected log-likelihoods.
In order to minimize
We now derive a general second-order approximation for the expected log-likelihood for a Gaussian latent-variable process with Poisson observations, where the latent variable may be linked to the Poisson intensity via an arbitrary link function. (See Zhou and Park, plus the expected log-likehood papers, for more detail).
In the case of intractable
Second-order approximations to the expected log-likelihood
To briefly review the notation, let
We need to compute expected log-likelihoods under the approximating posterior distribution
Where
In certain cases, the expectations
Gradient-based methods for optimizing the expected log-likelihood require derivatives of these approximated expectations. In general, derivatives with respect to the mean are:
A Newton-Raphson solver for optimizing the mean
Optimizing the likelihood may also involve optimizing the variance
For the case that
Interpreting the distribution of latent activations
In this second-order approximation, we circumvent this issue by considering a locally-quadratic approximation of the likelihood function that continues the Poisson likelihood to negative intensities. Provided variance is small, and the posterior mean is constrained to be positive, this approximation may provide an accurate estimate of the expected log-likelihood.
If
As
For the case that
A closed-form solution for the expected log-likelihood exists under this link function, and closed-form expressions for the moments of log-Gaussian random variables are known (see Zhou and Park for application to log-Gaussian point processes). However, in this application the amplification of positive tails of the distribution by the exponential link function is numerically unstable and unphysical, indicating that the log-Gaussian model is inappropriate. In Rule et al. 2018, we noted that a second-order approximation to the expected likelihood was more stable and more accurate.
For the case that
A quadratic link function is rectifying, so the Poisson likelihood remains well-defined even if the domain of
Incorporating the prior
So far we have focused on the expected log-likelihood contribution to the variational posterior objective function. We also need to derive the gradients of the KL-divergence term. For the most part, this is identical to the derivation in Zhou and Park, so I present only some quick notes here. We are interested in the gradients (and hessians) for the KL divergence between two
Note that I have chosen to denote this in terms of the precision matrix
The derivative with respect to
The derivative with respect to
The derivative with respect to
For the variational interpretation we approximate the posterior
Optimizing the (approximate) variational approximation
We have derived approximations for the expected log-likelihood contribution to the variational Bayesian objective function, which must be optimized jointly over the posterior mean
(In numerical experiments, I extended this approach by interleaving one-step of the Newton-Raphson optimization for
The covariance update
To complete the variational approximation, we also need to optimize the posterior covariance. This involves the derivative of the expected log-likelihood with respect to
The derivative of the above with respect to
along the diagonal, and 0 elsewhere. The total gradient for
Setting this gradient to zero and solving for
Which amounts to adding the curvature of the log likelihood
Computing the model likelihood
Once the Gaussian approximation is computed, how should we estimate the likelihood of the data given the model parameters
Integration via Laplace approximation
This likelihood is the integral of the prior
Given a Gaussian approximation
Under this approximation, we evaluate
We use the following logarithmic relationship derived from Bayes' rule
Substituting the forms for the above
Evaluating the above at the posterior mean
Cleaning things up, and denoting
On the similarity between the Laplace-approximated likelihood and
Note the similarity to KL divergence
Which gives the relation
Which allows the likelihood to be written as
We can play with this further, bringing in the second-order expected log-likelihood:
Which, from the second-order approximation, is:
So:
Which is to say that a point estimate of the log-likelihood is very similar to the (negative) KL-divergence penalty, differing only by the trace term and the curvature correction (and the dimensionality constant
Empirically, these terms are small and do not dominate the likelihood. As we shall see in the following sections, this similarity is not a coincidence: the Laplace approximation is connected to the Evidence Lower Bound (ELBO) in the variational Bayesian approach, especially when a second-order approximation is used to evaluate the expected log-likelihood.
The expected log-likelihood
One can consider the expected value of the log-likelihood relative to the prior distribution
For the true posterior
%For the variational approach (outlined below), we consider a bound based on taking the expectation of Eq.
For the expected log-likelihood approach here, we take the expectation with respect to the prior distribution
We cannot compute the above exactly, because we do not have access to the true posterior
To estimate the expected log-likelihood term
ELBO and variational Bayes
When deriving the variational update, we omitted the (constant) data likelihood term. The full form for
which implies the following identity for the log-likelihood:
That is: the data likelihood is the expected log-likelihood under the variational posterior, minus the KL divergence of the posterior form the prior, plus the KL divergence of the variational posterior from the true posterior. Since this last term is always positive, the following bound holds:
This inequality is sometimes called the Evidence Lower Bound (ELBO). The more accurately we can approximate the posterior as a Gaussian, the smaller
Going forward
In practice, we use the Laplace approximation for obtaining the approximate posterior
No comments:
Post a Comment