mrule-intheworks: April 2020

Form (a)

In most textbooks or tutorials you'll see the posterior mean written as:

\begin{matrix} (64) & \begin{aligned} y_{2} & \sim N (μ, Σ) \\ μ & = μ_{2} + Σ_{12}^{⊤} [Σ_{11} + Σ_{ϵ}]^{- 1} (y_{1} - μ_{1}) \\ Σ & = Σ_{22} - Σ_{12}^{⊤} [Σ_{11} + Σ_{ϵ}]^{- 1} Σ_{12}, \end{aligned} \end{matrix}

where $Σ_{11}$ and $Σ_{22}$ are the prior covariance of the observation and output points, respectively; $Σ_{12}$ is the cross-covariance of the observation and output points; $Σ_{ϵ}$ is the measurement noise covariance; $μ_{2}$ and $μ_{1}$ are the prior mean at the observation and output points, respectively, and $y_{1}$ are the observations.

When observations are sparse, this form is computationally efficient efficient, because $Σ_{11}$ and $Σ_{ϵ}$ are small, and the expression $Σ_{12}^{⊤} [Σ_{11} + Σ_{ϵ}]^{- 1} Σ_{12}$ is low-rank. This form is especially convenient when updating a GP model with a single observation, in which case it reduces to a rank-1 update

\begin{matrix} (65) & \begin{aligned} μ & = μ_{2} + \frac{1}{σ_{11}^{2} + σ_{ϵ}^{2}} Σ_{12}^{⊤} (y_{1} - μ_{1}) \\ Σ & = Σ_{22} - \frac{1}{σ_{11}^{2} + σ_{ϵ}^{2}} Σ_{12}^{⊤} Σ_{12} . \end{aligned} \end{matrix}

However, in our application we have an extended time-series where a rat visits each location many times. There are many more observations than output points, and $Σ_{11}$ and $Σ_{ϵ}$ are large matrices. Binning together nearby measurements reduces the complexity. However, coverage of the space is quite dense. $Σ_{11}$ and $Σ_{ϵ}$ are almost as large as $Σ_{22}$ , even excluding bins with no observations. We therefore assume that the observations and output points are evaluated on a common grid, in which case the equations take the form:

\begin{matrix} (a) & \begin{aligned} μ & = μ_{0} + Σ_{0}^{⊤} [Σ_{0} + Σ_{ϵ}]^{- 1} (y - μ_{0}) \\ Σ & = Σ_{0} - Σ_{0}^{⊤} [Σ_{0} + Σ_{ϵ}]^{- 1} Σ_{0}, \end{aligned} \end{matrix}

where $Σ_{0}$ $μ_{0}$ are the prior covariance and mean, $Σ_{ϵ}$ is the measurement error covariance, and $y$ are the measurements

This form has the following useful properties:

$Σ_{0}$ and $Σ_{ϵ}$ can be singular, provided $Σ_{0} + Σ_{ϵ}$ is not. In GP regression, this allows measurements with zero error, and also allows priors with zero eigenvalues. For example, one might set a prior that has zeros for high-frequency components, to encode a strong assumption that the posterior function should be smooth.
The update is low-rank and fast when observations are limited
If $Σ_{0} + Σ_{ϵ}$ is nonsingular, then it is positive definite. This allows one to compute $[Σ_{0} + Σ_{ϵ}]^{- 1} y$ quickly via Cholesky factorization. (However, it is even faster to use a Krylov subspace solver, if the prior $Σ_{0}$ is in a form that supports fast matrix-vector product via the FFT.)
The final matrix multiplications by $Σ_{0}$ can be computed via FFT. (The data need to be zero padded to do this, but this padding can be stripped away after performing the convolution, so there is no added complexity cost. Contrast this to form b, in which the zero-padding must be retained before solving a linear system, which increases the complexity).

Problem: This form is unsuitable if $Σ_{ϵ}^{- 1}$ is singular. In order to leverage the FFT for matrix calculations, we need to perform the GP regression over a regular and periodic grid. To avoid opposite ends of the space from interacting, we need to add as many zeros to the edge of our data as our kernel is wide. This means that, inherently, some bins will be missing data. Missing observations can be represented as zero precision (inverse variance), so $Σ_{ϵ}^{- 1}$ can be constructed—but it will not be invertable, so $Σ_{ϵ}$ does not exist.

Form (b)

It's also common to encounter the following form, when deriving the posterior mean for the product of two multivariate Gaussian distributions:

\begin{matrix} (b) & \begin{aligned} μ & = μ_{0} + {[Σ_{0}^{- 1} + Σ_{ϵ}^{- 1}]}^{- 1} Σ_{ϵ}^{- 1} (y - μ_{0}) . \end{aligned} \end{matrix}

This form has the following useful properties

$Σ_{ϵ}^{- 1}$ can be singular, so we can include "null" observations when calculating the regression over a regular grid
$Σ_{ϵ}^{- 1} = diag [τ]$ is diagonal, so $Σ_{ϵ}^{- 1} y = diag [τ \circ y]$ , which is trivial to compute
If $Σ_{0}$ is nonsingular, then $Σ_{0}^{- 1}$ exists and can be computed quickly via the FFT
$Σ_{0}^{- 1} + Σ_{ϵ}^{- 1}$ is positive definite. ${[Σ_{0}^{- 1} + Σ_{ϵ}^{- 1}]}^{- 1} diag [τ \circ y]$ can therefore be solve efficiently using Cholesky factorization.
Priors that assume correlations arise from nearest-neighbor interactions can be represented as a prior precision $Σ_{0}^{- 1}$ that is nonzero only for entries for pairs of adjacent regions. This sparsity, when combined with Krylov subspace algorithms for solving the linear system, this allows for fast solutions on arbitrary topologies, for which spectral methods might not be possible.

Problem: This form is unsuitable if $Σ_{0}$ is singular, and is numerically unstable if $Σ_{0}$ is ill-conditioned. So, it requires regularization to ensure that no eigenvalue of $Σ_{0}$ is too small. For larger problems, it may require so much regularization that the prior $Σ_{0}$ is altered substantially, affecting the accuracy of the inferred posterior mean. Additionally, computing $Σ_{0}^{- 1}$ via FFT requires adding zero padding to handle the circular boundary conditions correctly. This padding cannot be removed before solving the subsequent linear system without introducing boundary artifacts, so this can lead to slightly larger linear system to solve. Since the time complexity of this is $O (L^{6})$ in terms of the linear dimension of a $2 \times 2$ grid, this can lead to a significant slowdown.

Form (c)

We can also pull out $Σ_{0}$ from form (b), and solve for the posterior mean as :

\begin{matrix} (c) & \begin{aligned} μ & = μ_{0} + {[Σ_{0} Σ_{ϵ}^{- 1} + I]}^{- 1} [Σ_{0} Σ_{ϵ}^{- 1} (y - μ_{0})] . \end{aligned} \end{matrix}

This form has the following useful properties:

Both $Σ_{0}$ and $Σ_{ϵ}^{- 1}$ can be singular
$Σ_{ϵ}^{- 1} y = diag [τ \circ y]$ is trivial, and $Σ_{0} diag [τ \circ y]$ can be calculated via FFT
The product $Σ_{0} Σ_{ϵ}^{- 1} = Σ_{0} \circ (τ 1^{⊤})$ is trivial.
$Σ_{0} Σ_{ϵ}^{- 1} + I$ is well conditioned, so this calculation is fairly numerically stable, requiring less regularization.
If $Σ_{0}$ contains many zero eigenvalues, then it is low rank . In the special case that $Σ_{0}$ is circulant, the FFT offers a fast conversion to/from the eigenspace of $Σ_{0}$ , and it is possible to use the nonzero Fourier components of $Σ_{0}$ as a low-rank basis for calculations.

Problem: The matrix $Σ_{0} Σ_{ϵ}^{- 1} + I$ will not be symmetric in general, so we cannot use Cholesky factorization to solve the linear system here. This can lead to significant slow-downs for larger systems. However! Once $Σ_{0} Σ_{ϵ}^{- 1} y$ is computed via FFT, one can remove the zero-padding before solving the linear system, which reduces the problem size. Since computing circulant matrix operations using FFT requires as much zero-padding as our prior kernel is wide, this reduction in problem size can be significant. Since solving the linear system scales as $O (L^{6})$ for a $L \times L$ spatial grid, a small reduction in $L$ is much more important than the constant-factor improvement provided by Cholesky decomposition.

Additionally, for translationally-invariant kernels on a regular grid, the matrix-vector product $[Σ_{0} Σ_{ϵ}^{- 1} + I] u$ can be calculated quickly via FFT. This used with a Krylov-subspace-based solver like minres, and can be orders of magnitude faster than other approaches.

Form (d)

In the special case that the measurement error covariance is constant, $Σ_{ϵ} = σ_{ϵ}^{2} I$ , then GP regression reduces to a convolution, and can be evaluated using the Fourier transform thanks to the convolution theorem:

\begin{matrix} (d) & \begin{aligned} μ & = μ_{0} + F^{- 1} [(\frac{\tilde{κ}}{\tilde{κ} + σ_{ϵ}^{2}}) \circ F (y - μ_{0})], \end{aligned} \end{matrix}

where $F$ is the Fourier transform.

Problem: The regression must be on a periodic domain in order to compute $F$ quickly using the FFT. This can cause measurements at opposite sides of the spatial grid to influence each-other. This can be solved by zero-padding. However, the above assumption assumes $σ_{ϵ}^{2}$ for all observations, so this zero padding will also be treated as a $y = 0$ observation with error $σ_{ϵ}^{2}$ . This will lead to some tapering toward zero at the boundaries. Alternatively, one may use reflected boundary conditions, copying a mirrored version of $y$ into the padded space. This reflected boundary condition reduces artifacts, but is not identical to solving the original, un-padded GP regression problem.

Note: if $Σ_{ϵ}$ is not constant, but changes slowly relative to the scale of the prior kernel, then one may also decompose a larger problem into smaller ones that tile the space.

When to use which?

Use form (a) when

Measurement variance $Σ_{ϵ}$ and prior covariance $Σ_{ϵ}$ are well defined.
Observations are sparse compared to the number of output points.
This is as fast or faster than (b,c) for any size problem, but the extra matrix multiplications can be expensive for large systems, or when observations are dense.

Use form (b) with when

The measurements and outputs are evaluated at the same set of points.
$Σ_{ϵ}^{- 1}$ is well defined, but $Σ_{ϵ}$ is not.
$Σ_{0}^{- 1}$ exists and well-conditioned.
For medium-sized systems,
- $Σ_{0}^{- 1} + Σ_{ϵ}$ can be solved via Cholesky factorization.
For large systems, use a Krylov subspace solver and leverage special properties of $Σ_{0}^{- 1}$ :
- $Σ_{0}^{- 1}$ is circulant, so we can use the FFT, or
- $Σ_{0}^{- 1}$ is sparse, arising from a nearest-neighbor model.

Use form (c) when

The measurements and outputs are taken at the same set of points
$Σ_{ϵ}^{- 1}$ is well defined, but $Σ_{ϵ}$ is not.
$Σ_{0}$ is low-rank so $Σ_{0}^{- 1}$ does not exist.
For large systems, use a Krylov subspace solver and leverage special properties of $Σ_{0}$ :
- If $Σ_{0}$ is circulant, use the FFT
- Or, use a low-rank approximation $Σ_{0} = Q Q^{⊤}$

Use form (d) when

The measurements and outputs are evaluated on a regular grid.
The problem is too large to calculate using ordinary matrix operations
Measurement error can be approximated as constant $Σ_{ϵ} \approx σ_{ϵ}^{2} I$
Artifacts from zero or mirrored boundary conditions are acceptable, or the kernel is local and it is acceptable to discard the boundary regions.

mrule-intheworks

Thursday, April 9, 2020

Four ways to do Gaussian process regression

Form (a)

Form (b)

Form (c)

Form (d)

When to use which?