These notes provide the derivatives of the KL-divergence $D_{\text{KL}}\left[ Q(\mathbf z) \| P(\mathbf z)\right]$ between two multivariate Gaussian distributions $Q(\mathbf z)$ and $P(\mathbf z)$ with respect to a few parameterizations $\theta$ of the covariance matrix $\boldsymbol\Sigma(\theta)$ of $Q$. This is useful for variational Gaussian process inference, where clever parameterizations of the posterior covariance are required to make the problem tractable. Tables for differentiating matrix-valued functions can be found in The Matrix Cookbook .

Consider two multivariate Gaussian distributions $Q(\mathbf z) = \mathcal N(\boldsymbol\mu_q,\boldsymbol\Sigma(\theta))$ and $P(\mathbf z) = \mathcal N(\boldsymbol\mu_0,\boldsymbol\Sigma_0 = \boldsymbol\Lambda^{-1})$ with dimension $L$. The KL divergence $D_{\text{KL}}\left[ Q(\mathbf z) \| P(\mathbf z)\right]$ has the closed form

\begin{equation}\begin{aligned} \mathcal D := &\, D_{\text{KL}}\left[ Q(\mathbf z) \| \Pr(\mathbf z)\right] \\ = &\, \tfrac 1 2 \left\{ (\boldsymbol\mu_0-\boldsymbol\mu_q)^\top \boldsymbol\Lambda (\boldsymbol\mu_0-\boldsymbol\mu_q) \right. \\ &\,\left.+ \operatorname{tr}\left( \boldsymbol\Lambda \boldsymbol\Sigma \right) - \ln|\boldsymbol\Sigma| - \ln|\boldsymbol\Lambda| \right\} +\text{constant.} \end{aligned} \label{dkl} \end{equation}

In variational Bayesian inference, we minimize $\mathcal D$ while maximizing the expected log-probability of some observations with respect to $Q(\mathbf z)$. Closed-form derivatives of $\mathcal D$ in terms of the parameters of $Q$ are useful for manually optimizing code for larger problems. The derivatives of $\mathcal D$ in terms of $\boldsymbol\mu_q$ are straightforward: $\partial_{\boldsymbol\mu_q}\mathcal D=\boldsymbol\Lambda(\boldsymbol\mu_q-\boldsymbol\mu_z)$ and $\operatorname H_{\boldsymbol\mu_q}\mathcal D=\boldsymbol\Lambda$. In these notes, we explore derivatives of $\mathcal D$ with respect to a few different parameterizations ("$\theta$") of $\boldsymbol\Sigma(\theta)$.

We evaluate the following parameterizations for $\boldsymbol\Sigma$:

Optimizing the full $\boldsymbol\Sigma$ directly
$\boldsymbol\Sigma\approx\mathbf X\mathbf X^\top$
$\boldsymbol\Sigma\approx\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A$
$\boldsymbol\Sigma\approx[\boldsymbol\Lambda + \operatorname{diag}[\mathbf p]]^{-1}$
$\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F$, where $\mathbf Q\in\mathbb R^{K{\times}K}$, $K{<}L$ and $\mathbf F\in\mathbb R^{K{\times}L}$, $\mathbf F\mathbf F^\top = \mathbf I$

Optimizing ${\boldsymbol\Sigma}$ directly

We first obtain gradients of $\mathcal D$ in ${\boldsymbol\Sigma}$ (assuming ${\boldsymbol\Sigma}$ is full-rank). These can be used to derive gradients in $\theta$ for some parameterizations ${\boldsymbol\Sigma}(\theta)$ using the chain rule. The gradient of $\mathcal D$ in $\boldsymbol\Sigma$ can be obtained using identities (57) and (100) in The Matrix Cookbook:

\begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma}\mathcal D &= \partial_{\boldsymbol\Sigma} \left\{ \operatorname{tr}\left( \boldsymbol\Lambda \boldsymbol\Sigma \right) - \ln|\boldsymbol\Sigma| \right\} \\ &= \tfrac 1 2 \left( \boldsymbol\Lambda - \boldsymbol\Sigma^{-1} \right). \end{aligned}\label{js}\end{equation}

The Hessian in $\boldsymbol\Sigma$ is a fourth-order tensor. It's simpler to express the Hessian in terms of a Hessian-vector product, which can be used with Krylov subspace solvers to efficiently compute the update in Newton's method. Considering an $L{\times}L$ matrix $\mathbf M$, the Hessian-vector product is given by

\begin{equation}\begin{aligned} \left[ \mathbf H_{\boldsymbol\Sigma}\mathcal D \right] \mathbf M &= \partial_{\boldsymbol\Sigma} \left< \partial_{\boldsymbol\Sigma}\mathcal D,\mathbf M\right> = \partial_{\boldsymbol\Sigma} \operatorname{tr}\left[ (\partial_{\boldsymbol\Sigma}\mathcal D)^\top \mathbf M\right], \end{aligned}\end{equation}

where $\langle\cdot,\cdot\rangle$ denotes the scalar (Frobenius) product. This is given by identity (124) in the Matrix Cookbook:

\begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma} \operatorname{tr}\left[ \frac 1 2 \left( \boldsymbol\Lambda - \boldsymbol\Sigma^{-1} \right) ^\top \mathbf M \right] &= -\frac 1 2 \partial_{\boldsymbol\Sigma} \operatorname{tr}\left[ \boldsymbol\Sigma^{-1} \mathbf M \right] = \frac 1 2 \boldsymbol\Sigma^{-1} \mathbf M^\top \boldsymbol\Sigma^{-1}. \end{aligned}\label{hvs}\end{equation}

Optimizing ${\boldsymbol\Sigma}{\approx}{\mathbf X}{{\mathbf X}^{\top}}$

We consider an approximate posterior covariance of the form

\begin{equation}\begin{aligned} \boldsymbol\Sigma &\approx \mathbf X \mathbf X^\top,\,\,\,\,\mathbf X\in\mathbb R^{L\times K} \end{aligned}\end{equation}

where $\mathbf X$ is a rank-$K<L$ matrix with $L$ rows and $K$ columns.

Since $\mathbf X$ is not full rank, the log-determinant $\ln|{\boldsymbol\Sigma}|=\ln|\mathbf X\mathbf X^\top|$ in $\eqref{dkl}$ diverges, due to the zero eigenvalues in the null space of $\mathbf X$. However, since this null-space is not being optimized, it does not affect our gradient. It is sufficient to replace the log-determinant with that of the reduced-rank representation, $\ln|\mathbf X^\top\mathbf X|$. Identity (55) in The Matrix Cookbook provides the derivative of this, $\partial_{\mathbf X}\ln|\mathbf X^\top\mathbf X| = 2 {\mathbf X^{+}}^\top$, where $(\cdot)^+$ is the pseudoinverse. Combined with identity (112), this gives the following gradient of $\mathcal D(\mathbf X)$:

\begin{equation}\begin{aligned} \partial_{\mathbf X} \mathcal D &= \partial_{\mathbf X} \tfrac 1 2 \left\{ \operatorname{tr}\left[ \boldsymbol\Lambda \mathbf X\mathbf X^\top \right] - \ln|\mathbf X^\top\mathbf X| \right\} = \boldsymbol\Lambda \mathbf X - {\mathbf X^{+}}^\top. \end{aligned} \label{jxx} \end{equation}

The Hessian-vector product requires the derivative of $\partial_{\mathbf X} \operatorname{tr}\left[{\mathbf X^{+}}\mathbf M\right]$:

\begin{equation}\begin{aligned} \partial_{\mathbf X} \left< \partial_{\boldsymbol\Sigma} \mathcal D, \mathbf M \right> &= \partial_{\mathbf X} \operatorname{tr}\left[ \left( \boldsymbol\Lambda \mathbf X - {\mathbf X^{+}}^\top \right) ^\top \mathbf M \right] = \partial_{\mathbf X} \operatorname{tr}\left[ \boldsymbol\Lambda \mathbf X \mathbf M \right] - \partial_{\mathbf X} \operatorname{tr}\left[ {\mathbf X^{+}} \mathbf M \right]. \end{aligned}\end{equation}

Goulob and Pereya (1972) Eq. 4.12 gives the derivative of a fixed-rank pseudoinverse:

\begin{equation}\begin{aligned} \partial \mathbf X^+ = - \mathbf X^+ (\partial \mathbf X) \mathbf X^+ + \mathbf X^+ \mathbf X{^+}^\top (\partial \mathbf X)^\top (1-\mathbf X \mathbf X^+) + (1-\mathbf X^+ \mathbf X)(\partial \mathbf X)^\top \mathbf X{^+}^\top \mathbf X^+ \end{aligned}\label{dpinv}\end{equation}

Since $\mathbf X$ is $N\times K$ with rank $K$, $\mathbf X^+ \mathbf X$ is full-rank. Therefore $\mathbf X^+\mathbf X = \mathbf I_k$ and the final term in $\eqref{dpinv}$ vanishes. The derivative of the pseudoinverse can now be written as:

\begin{equation}\begin{aligned} \partial \mathbf X^+ &= - \mathbf X^+ (\partial \mathbf X) \mathbf X^+ + \mathbf X^+ \mathbf X{^+}^\top (\partial \mathbf X)^\top ( \mathbf I_n - \mathbf X\mathbf X^+ ) \end{aligned}\end{equation}

Since the derivative of a trace of a matrix-valued function is just the (transpose) of the scalar derivative,

\begin{equation}\begin{aligned} \partial_{\mathbf X} \operatorname{tr}\left[{\mathbf X^{+}}\mathbf M\right] &= \left\{ - \mathbf X^+ \mathbf M \mathbf X^+ + \mathbf X^+ \mathbf X{^+}^\top \mathbf M^\top ( \mathbf I_n - \mathbf X\mathbf X^+ ) \right\}^\top \\ &= -{\mathbf X^+}^\top \mathbf M^\top {\mathbf X^+}^\top + (\mathbf I - {\mathbf X^+}^\top \mathbf X^\top) \mathbf M \mathbf X^+ {\mathbf X^+}^\top. \end{aligned}\end{equation}

Overall, we obtain the following Hessian-vector product:

\begin{equation}\begin{aligned} \partial_{\mathbf X} \left< \partial_{\boldsymbol\Sigma} \mathcal D, \mathbf M \right> &= \boldsymbol\Lambda\mathbf M^\top + {\mathbf X^+}^\top \mathbf M^\top {\mathbf X^+}^\top - (\mathbf I - {\mathbf X^+}^\top \mathbf X^\top) \mathbf M \mathbf X^+ {\mathbf X^+}^\top \end{aligned}\label{hvxx}\end{equation}

Optimizing ${\boldsymbol\Sigma}{=}{\mathbf X}{{\mathbf X}^{\top}}$ when $\mathbf X$ is full-rank

Equations $\eqref{jxx}$ and $\eqref{hvxx}$ are also valid if $\mathbf X$ is a rank-$L$ triangular (Choleskey) factorization of ${\boldsymbol\Sigma}$. In this case the pseudoinverse can be replaced by the full inverse, and various terms simplify:

\begin{equation} \begin{aligned} \partial_{\mathbf X} \mathcal D &= \boldsymbol\Lambda \mathbf X - \mathbf X^{-\top} \\ \partial_{\mathbf X} \left< \partial_{\mathbf x} \mathcal D, \mathbf M \right> &= \boldsymbol\Lambda\mathbf M^\top + \mathbf X^{-\top} \mathbf M^\top \mathbf X^{-\top} \end{aligned} \label{dxx} \end{equation}

Optimizing ${\boldsymbol\Sigma}=\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A$

Let ${\boldsymbol\Sigma}=\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A$, where $\mathbf A$ is fixed and $\mathbf v\in\mathbb R^L$ are free parameters. Define $\operatorname{diag}[\cdot]$ as an operator that constructs a diagonal matrix from a vector, or extracts the main diagonal from a matrix if its argument is a matrix. The gradient of $\mathcal D$ in $\mathbf v$ is:

\begin{equation}\begin{aligned} \partial_{\mathbf X} \mathcal D &= \partial_{\mathbf X} \tfrac 1 2 \left\{ \operatorname{tr}\left[ \boldsymbol\Lambda \mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A \right] - \ln|\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A| \right\} \\ &= \tfrac12 \left\{ \operatorname{diag}[\mathbf A \boldsymbol\Lambda \mathbf A^\top] - \tfrac 1 {\mathbf v} \right\} \end{aligned} \end{equation}

The hessian in $\mathbf v$ is a matrix in this case:

\begin{equation}\begin{aligned} \operatorname{H}_{\mathbf v} \mathcal D &= \tfrac 1 2 \operatorname{diag}\left[\tfrac 1 {\mathbf v^2}\right]. \end{aligned}\end{equation}

This parameterization is useful for spatiotemporal inference problems, where the matrix $\mathbf A$ represents a fixed convolution which can be evaluated using the Fast Fourier Transform (FFT).

Inverse-diagonal approximation

Let $\boldsymbol\Sigma^{-1} = \boldsymbol\Lambda + \operatorname{diag}\left[\mathbf p\right]$. To obtain the gradient in $\mathbf p$, combine the derivatives $\partial_{\boldsymbol\Sigma}\mathcal D$ (Eq. $\eqref{js}$) and $\partial_{\mathbf p}\boldsymbol\Sigma$ using the chain rule. If $\mathcal f(\boldsymbol\Sigma)$ is a function of $\boldsymbol\Sigma$, and $\boldsymbol\Sigma(\theta_i)$ is a function of a parameter $\theta_i$, then the chain rule is (The Matrix Cookbook; Eq. 136):

\begin{equation}\begin{aligned} {\partial_{\theta_i}}\mathcal f = \left< {\partial_{\boldsymbol\Sigma}}\mathcal f , {\partial_{\theta_i}}\boldsymbol\Sigma \right> = \sum_{kl} ({\partial_{\boldsymbol\Sigma_{kl}}}\mathcal f) ({\partial_{\theta_i}}\boldsymbol\Sigma_{kl}) \end{aligned}\label{chain}\end{equation}

From $\eqref{js}$ we have $\partial_{\boldsymbol\Sigma}\mathcal D=\tfrac 1 2 \left(\boldsymbol\Lambda-\boldsymbol\Sigma^{-1}\right)$; Since $\boldsymbol\Sigma^{-1} = \boldsymbol\Lambda + \operatorname{diag}\left[\mathbf p\right]$, this simplifies to:

\begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma}\mathcal D &= \tfrac 1 2 \left( \boldsymbol\Lambda - \boldsymbol\Sigma^{-1} \right) \\ &= \tfrac 1 2 \left( \boldsymbol\Lambda - \boldsymbol\Lambda - \operatorname{diag}\left[\mathbf p\right] \right) \\ &= -\tfrac 1 2 \operatorname{diag}\left[\mathbf p\right] \end{aligned}\label{jsp}\end{equation}

We also need $\partial_{\mathbf p_i}\boldsymbol\Sigma$. Let $\mathbf Y=\boldsymbol\Sigma^{-1}$. The derivative $\partial\mathbf Y^{-1}$ is given as identity (59) in The Matrix Cookbook as $\partial\mathbf Y^{-1} = - \mathbf Y^{-1}(\partial\mathbf Y)\mathbf Y^{-1}$. Using this, we can obtain $\partial_{\mathbf p_i}\boldsymbol\Sigma$:

\begin{equation}\begin{aligned} \partial_{\mathbf p_i}\boldsymbol\Sigma &= \partial_{\mathbf p_i}\mathbf Y^{-1} = -\mathbf Y^{-1}\left(\partial_{\mathbf p_i} \mathbf Y\right)\mathbf Y^{-1} = - \boldsymbol\Sigma \left( \partial_{\mathbf p_i} \boldsymbol\Sigma^{-1} \right) \boldsymbol\Sigma \\ &= - \boldsymbol\Sigma \partial_{\mathbf p_i} \left[\boldsymbol\Lambda + \operatorname{diag}[\mathbf p_i] \right] \boldsymbol\Sigma = - \boldsymbol\Sigma \mathbf J_{ii} \boldsymbol\Sigma \\ &= - \boldsymbol\sigma_i \boldsymbol\sigma_i^\top \end{aligned}\label{jsp2}\end{equation}

where $\boldsymbol\sigma_i$ is the $i^{\text{th}}$ row of $\boldsymbol\Sigma$ and $\mathbf J_{ii}$ is a matrix which is zero everywhere, except for at index $(i,i)$, where it is $1$.

Applying $\eqref{chain}$ to $\eqref{jsp}$ and $\eqref{jsp2}$ for a particular element $\mathbf p_i$ gives:

\begin{equation}\begin{aligned} \partial_{\mathbf p_i}\mathcal D &= \sum_{kl} [\partial_{\boldsymbol\Sigma_{kl}} \mathcal D] [\partial_{\mathbf p_i} \boldsymbol\Sigma_{kl} ] = \sum_{kl} \left\{ -\tfrac 1 2 \operatorname{diag}\left[\mathbf p\right] \right\}_{kl} \left\{ - \boldsymbol\sigma_i \boldsymbol\sigma_i^\top \right\}_{kl} \\ &= \tfrac 1 2 \sum_{kl} \delta_{k=l} \mathbf p_k \boldsymbol\sigma_{ik} \boldsymbol\sigma_{il} = \tfrac 1 2 \sum_{k} \mathbf p_k \boldsymbol\sigma_{ik} \boldsymbol\sigma_{ik} = \tfrac 1 2 \sum_{k} \mathbf p_k \boldsymbol\sigma_{ik}^2 \\ &= \tfrac 1 2 \mathbf p \boldsymbol\sigma_i^{\circ 2} \end{aligned}\end{equation}

where $(\cdot)^{\circ 2}$ denotes the element-wise square of a vector or matrix. In matrix notation, this is:

\begin{equation}\begin{aligned} \partial_{\mathbf p}\mathcal D &= \tfrac 1 2 \mathbf p \boldsymbol\Sigma^{\circ 2} = \tfrac 1 2 \operatorname{diag}\left[ \boldsymbol\Sigma \operatorname{diag}\left[ \mathbf p \right] \boldsymbol\Sigma \right], \end{aligned}\end{equation}

The Hessian-vector product is cumbersome, since each term in the expression $\boldsymbol\Sigma\left(\operatorname{diag}\left[\mathbf p\right]\right) \boldsymbol\Sigma$ depends on $\mathbf p$. In the case of the log-linear Poisson GLM, the gradient $\eqref{jp}$ simplifies further and optimization becomes tractable. We will explore this further in later notes.

This parameterization resembles the closed-form covariance update for a linear, Gaussian model, where $1/\mathbf p$ is a vector of measurement noise variances. It is also a useful parameterization for variational Bayesian solutions for non-conjugate Generalized Linear Models (GLMs), where $\mathbf p$ becomes a free parameter to be estimated.

Optimizing ${\boldsymbol\Sigma}=\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F$

Let ${\boldsymbol\Sigma}=\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F$, where $\mathbf Q\in\mathbb R^{K{\times}K}$; $K{<}L$ is the free parameter and $\mathbf F\in\mathbb R^{K{\times}L}$ is a fixed transformation. If $\mathbf Q$ is a lower-triangular matrix, then this approximation involves optimizing $K(K+1)/2$ parameters.

Since the trace is invariant under cyclic permutation, $\operatorname{tr}\left[\boldsymbol\Lambda\mathbf F^\top\mathbf Q\mathbf Q^\top\mathbf F\right]=\operatorname{tr}\left[\mathbf F \boldsymbol\Lambda\mathbf F^\top \mathbf Q\mathbf Q^\top\right]$. The derivatives have the same form as $\eqref{dxx}$ with $\tilde{\boldsymbol\Lambda} = \mathbf F\boldsymbol\Lambda\mathbf F^\top$:

\begin{equation} \begin{aligned} \partial_{\mathbf Q} \mathcal D &= \tilde{\boldsymbol\Lambda} \mathbf Q - \mathbf Q^{-\top} \\ &= \mathbf F\boldsymbol\Lambda\mathbf F^\top \mathbf Q - \mathbf Q^{-\top} \\ \partial_{\mathbf Q} \left< \partial_{\mathbf Q} \mathcal D, \mathbf M \right> &= \tilde{\boldsymbol\Lambda}\mathbf M^\top + \mathbf Q^{-\top} \mathbf M^\top \mathbf Q^{-\top} \\ &= \mathbf F\boldsymbol\Lambda\mathbf F^\top\mathbf M^\top + \mathbf Q^{-\top} \mathbf M^\top \mathbf Q^{-\top} \end{aligned} \end{equation}

This form is convenient for spatiotemporal inference problems that are sparse in frequency space. In this application, $\mathbf F$ corresponds a (unitary) Fourier transform with all by $K$ of the resulting frequency components discarded. The product of $\mathbf F$ with a vector $\mathbf v$ can be computed in $\mathcal O[L \log(L)]$ time using the Fast Fourier Transform (FFT). Alternatively, if $K\le \mathcal O(\log(L))$, it is faster to simply multiply $\mathbf F\mathbf v$ directly. Furthermore, if $\mathbf F$ is semi-orthogonal ($\mathbf F\mathbf F^\top = \mathbf I$), then calculation of $\mathbf F^\top \mathbf Q$ can be re-used (for example $\operatorname{diag}[\boldsymbol\Sigma] = [(\mathbf F^\top \mathbf Q)^{\circ 2}]^\top \mathbf 1$).

Conclusion

These notes provide the gradients and Hessian-vector products for four simplified parameterizations of the posterior covariance matrix for variational Gaussian process inference. If combined with the gradients and Hessian-vector products for the expected log-likelihood, these expressions can be used with Krylov-subspace solvers to compute the Newton-Raphson update to optimize $\mathbf \Sigma$.

We evaluated the following parameterizations for $\boldsymbol\Sigma$:

$\boldsymbol\Sigma$:

\begin{equation}\begin{aligned} \partial &= \tfrac 1 2 \left( \boldsymbol\Lambda - \boldsymbol\Sigma^{-1} \right) \\ \partial \left< \partial,\mathbf M\right> &= \frac 1 2 \boldsymbol\Sigma^{-1} \mathbf M^\top \boldsymbol\Sigma^{-1} \end{aligned}\end{equation}

$\boldsymbol\Sigma\approx\mathbf X\mathbf X^\top$:

\begin{equation}\begin{aligned} \partial &= \boldsymbol\Lambda \mathbf X - {\mathbf X^{+}}^\top. \\ \partial\left< \partial, \mathbf M \right> &= \boldsymbol\Lambda\mathbf M^\top + {\mathbf X^+}^\top \mathbf M^\top {\mathbf X^+}^\top - (\mathbf I - {\mathbf X^+}^\top \mathbf X^\top) \mathbf M \mathbf X^+ {\mathbf X^+}^\top \end{aligned}\end{equation}

$\boldsymbol\Sigma\approx\mathbf A^\top \operatorname{diag}[\mathbf v] \mathbf A$:

\begin{equation}\begin{aligned} \partial &= \tfrac12 \left\{ \operatorname{diag}[\mathbf A \boldsymbol\Lambda \mathbf A^\top] - \tfrac 1 {\mathbf v} \right\} \\ \partial\left< \partial, \mathbf u \right> &= \tfrac 1 2 \left[\tfrac 1 {\mathbf v^2}\right]^\top \mathbf u \end{aligned} \end{equation}

$\boldsymbol\Sigma\approx[\boldsymbol\Lambda + \operatorname{diag}[\mathbf p]]^{-1}$:

\begin{equation}\begin{aligned} \partial &= \tfrac 1 2 \mathbf p \boldsymbol\Sigma^{\circ 2} = \tfrac 1 2 \operatorname{diag}\left[ \boldsymbol\Sigma \operatorname{diag}\left[ \mathbf p \right] \boldsymbol\Sigma \right], \end{aligned}\end{equation}

$\mathbf F^\top \mathbf Q \mathbf Q^\top \mathbf F$:

\begin{equation} \begin{aligned} \partial &= \mathbf F\boldsymbol\Lambda\mathbf F^\top \mathbf Q - \mathbf Q^{-\top} \\ \partial\left< \partial, \mathbf M \right> &= \mathbf F\boldsymbol\Lambda\mathbf F^\top\mathbf M^\top + \mathbf Q^{-\top} \mathbf M^\top \mathbf Q^{-\top} \end{aligned} \end{equation}

In future notes, we will consider the full derivatives required for variational latent Gaussian-process inference for the Poisson and probit generalized linear models.

mrule-intheworks

Wednesday, March 25, 2020

Note: Derivatives of Gaussian KL-Divergence for some parameterizations of the posterior covariance for variational Gaussian-process inference