Form (a)
In most textbooks or tutorials you'll see the posterior mean written as:
where and are the prior covariance of the observation and output points, respectively; is the cross-covariance of the observation and output points; is the measurement noise covariance; and are the prior mean at the observation and output points, respectively, and are the observations.
When observations are sparse, this form is computationally efficient efficient, because and are small, and the expression is low-rank. This form is especially convenient when updating a GP model with a single observation, in which case it reduces to a rank-1 update
However, in our application we have an extended time-series where a rat visits each location many times. There are many more observations than output points, and and are large matrices. Binning together nearby measurements reduces the complexity. However, coverage of the space is quite dense. and are almost as large as , even excluding bins with no observations. We therefore assume that the observations and output points are evaluated on a common grid, in which case the equations take the form:
where are the prior covariance and mean, is the measurement error covariance, and are the measurements
This form has the following useful properties:
and can be singular, provided is not. In GP regression, this allows measurements with zero error, and also allows priors with zero eigenvalues. For example, one might set a prior that has zeros for high-frequency components, to encode a strong assumption that the posterior function should be smooth.
The update is low-rank and fast when observations are limited
If is nonsingular, then it is positive definite. This allows one to compute quickly via Cholesky factorization. (However, it is even faster to use a Krylov subspace solver, if the prior is in a form that supports fast matrix-vector product via the FFT.)
The final matrix multiplications by can be computed via FFT. (The data need to be zero padded to do this, but this padding can be stripped away after performing the convolution, so there is no added complexity cost. Contrast this to form b, in which the zero-padding must be retained before solving a linear system, which increases the complexity).
Problem: This form is unsuitable if is singular. In order to leverage the FFT for matrix calculations, we need to perform the GP regression over a regular and periodic grid. To avoid opposite ends of the space from interacting, we need to add as many zeros to the edge of our data as our kernel is wide. This means that, inherently, some bins will be missing data. Missing observations can be represented as zero precision (inverse variance), so can be constructed—but it will not be invertable, so does not exist.
Form (b)
It's also common to encounter the following form, when deriving the posterior mean for the product of two multivariate Gaussian distributions:
This form has the following useful properties
- can be singular, so we can include "null" observations when calculating the regression over a regular grid
- is diagonal, so , which is trivial to compute
- If is nonsingular, then exists and can be computed quickly via the FFT
- is positive definite. can therefore be solve efficiently using Cholesky factorization.
- Priors that assume correlations arise from nearest-neighbor interactions can be represented as a prior precision that is nonzero only for entries for pairs of adjacent regions. This sparsity, when combined with Krylov subspace algorithms for solving the linear system, this allows for fast solutions on arbitrary topologies, for which spectral methods might not be possible.
Problem: This form is unsuitable if is singular, and is numerically unstable if is ill-conditioned. So, it requires regularization to ensure that no eigenvalue of is too small. For larger problems, it may require so much regularization that the prior is altered substantially, affecting the accuracy of the inferred posterior mean. Additionally, computing via FFT requires adding zero padding to handle the circular boundary conditions correctly. This padding cannot be removed before solving the subsequent linear system without introducing boundary artifacts, so this can lead to slightly larger linear system to solve. Since the time complexity of this is in terms of the linear dimension of a grid, this can lead to a significant slowdown.
Form (c)
We can also pull out from form (b), and solve for the posterior mean as :
This form has the following useful properties:
- Both and can be singular
- is trivial, and can be calculated via FFT
- The product is trivial.
- is well conditioned, so this calculation is fairly numerically stable, requiring less regularization.
- If contains many zero eigenvalues, then it is low rank . In the special case that is circulant, the FFT offers a fast conversion to/from the eigenspace of , and it is possible to use the nonzero Fourier components of as a low-rank basis for calculations.
Problem: The matrix will not be symmetric in general, so we cannot use Cholesky factorization to solve the linear system here. This can lead to significant slow-downs for larger systems. However! Once is computed via FFT, one can remove the zero-padding before solving the linear system, which reduces the problem size. Since computing circulant matrix operations using FFT requires as much zero-padding as our prior kernel is wide, this reduction in problem size can be significant. Since solving the linear system scales as for a spatial grid, a small reduction in is much more important than the constant-factor improvement provided by Cholesky decomposition.
Additionally, for translationally-invariant kernels on a regular grid, the matrix-vector product can be calculated quickly via FFT. This used with a Krylov-subspace-based solver like minres, and can be orders of magnitude faster than other approaches.
Form (d)
In the special case that the measurement error covariance is constant, , then GP regression reduces to a convolution, and can be evaluated using the Fourier transform thanks to the convolution theorem:
where is the Fourier transform.
Problem: The regression must be on a periodic domain in order to compute quickly using the FFT. This can cause measurements at opposite sides of the spatial grid to influence each-other. This can be solved by zero-padding. However, the above assumption assumes for all observations, so this zero padding will also be treated as a observation with error . This will lead to some tapering toward zero at the boundaries. Alternatively, one may use reflected boundary conditions, copying a mirrored version of into the padded space. This reflected boundary condition reduces artifacts, but is not identical to solving the original, un-padded GP regression problem.
Note: if is not constant, but changes slowly relative to the scale of the prior kernel, then one may also decompose a larger problem into smaller ones that tile the space.
When to use which?
Use form (a) when
- Measurement variance and prior covariance are well defined.
- Observations are sparse compared to the number of output points.
- This is as fast or faster than (b,c) for any size problem, but the extra matrix multiplications can be expensive for large systems, or when observations are dense.
Use form (b) with when
- The measurements and outputs are evaluated at the same set of points.
- is well defined, but is not.
- exists and well-conditioned.
- For medium-sized systems,
- can be solved via Cholesky factorization.
- For large systems, use a Krylov subspace solver and leverage special properties of :
- is circulant, so we can use the FFT, or
- is sparse, arising from a nearest-neighbor model.
Use form (c) when
- The measurements and outputs are taken at the same set of points
- is well defined, but is not.
- is low-rank so does not exist.
- For large systems, use a Krylov subspace solver and leverage special properties of :
- If is circulant, use the FFT
- Or, use a low-rank approximation
Use form (d) when
- The measurements and outputs are evaluated on a regular grid.
- The problem is too large to calculate using ordinary matrix operations
- Measurement error can be approximated as constant
- Artifacts from zero or mirrored boundary conditions are acceptable, or the kernel is local and it is acceptable to discard the boundary regions.