Wednesday, March 25, 2020

Note: Derivatives of Gaussian KL-Divergence for some parameterizations of the posterior covariance for variational Gaussian-process inference

These notes provide the derivatives of the KL-divergence DKL[Q(z)P(z)] between two multivariate Gaussian distributions Q(z) and P(z) with respect to a few parameterizations θ of the covariance matrix Σ(θ) of Q. This is useful for variational Gaussian process inference, where clever parameterizations of the posterior covariance are required to make the problem tractable. Tables for differentiating matrix-valued functions can be found in The Matrix Cookbook .

Consider two multivariate Gaussian distributions Q(z)=N(μq,Σ(θ)) and P(z)=N(μ0,Σ0=Λ1) with dimension L. The KL divergence DKL[Q(z)P(z)] has the closed form

(1)D:=DKL[Q(z)Pr(z)]=12{(μ0μq)Λ(μ0μq)+tr(ΛΣ)ln|Σ|ln|Λ|}+constant.

In variational Bayesian inference, we minimize D while maximizing the expected log-probability of some observations with respect to Q(z). Closed-form derivatives of D in terms of the parameters of Q are useful for manually optimizing code for larger problems. The derivatives of D in terms of μq are straightforward: μqD=Λ(μqμz) and HμqD=Λ. In these notes, we explore derivatives of D with respect to a few different parameterizations ("θ") of Σ(θ).

We evaluate the following parameterizations for Σ:

  1. Optimizing the full Σ directly
  2. ΣXX
  3. ΣAdiag[v]A
  4. Σ[Λ+diag[p]]1
  5. FQQF, where QRK×K, K<L and FRK×L, FF=I


Optimizing Σ directly

We first obtain gradients of D in Σ (assuming Σ is full-rank). These can be used to derive gradients in θ for some parameterizations Σ(θ) using the chain rule. The gradient of D in Σ can be obtained using identities (57) and (100) in The Matrix Cookbook:

(2)ΣD=Σ{tr(ΛΣ)ln|Σ|}=12(ΛΣ1).

The Hessian in Σ is a fourth-order tensor. It's simpler to express the Hessian in terms of a Hessian-vector product, which can be used with Krylov subspace solvers to efficiently compute the update in Newton's method. Considering an L×L matrix M, the Hessian-vector product is given by

(3)[HΣD]M=ΣΣD,M=Σtr[(ΣD)M],

where , denotes the scalar (Frobenius) product. This is given by identity (124) in the Matrix Cookbook:

(4)Σtr[12(ΛΣ1)M]=12Σtr[Σ1M]=12Σ1MΣ1. 
 


Optimizing ΣXX

We consider an approximate posterior covariance of the form

(5)ΣXX,XRL×K

where X is a rank-K<L matrix with L rows and K columns.

Since X is not full rank, the log-determinant ln|Σ|=ln|XX| in (1) diverges, due to the zero eigenvalues in the null space of X. However, since this null-space is not being optimized, it does not affect our gradient. It is sufficient to replace the log-determinant with that of the reduced-rank representation, ln|XX|. Identity (55) in The Matrix Cookbook provides the derivative of this, Xln|XX|=2X+, where ()+ is the pseudoinverse. Combined with identity (112), this gives the following gradient of D(X):

(6)XD=X12{tr[ΛXX]ln|XX|}=ΛXX+.

The Hessian-vector product requires the derivative of Xtr[X+M]:

(7)XΣD,M=Xtr[(ΛXX+)M]=Xtr[ΛXM]Xtr[X+M].

Goulob and Pereya (1972) Eq. 4.12 gives the derivative of a fixed-rank pseudoinverse:

(8)X+=X+(X)X++X+X+(X)(1XX+)+(1X+X)(X)X+X+

Since X is N×K with rank K, X+X is full-rank. Therefore X+X=Ik and the final term in (8) vanishes. The derivative of the pseudoinverse can now be written as:

(9)X+=X+(X)X++X+X+(X)(InXX+)

Since the derivative of a trace of a matrix-valued function is just the (transpose) of the scalar derivative,

(10)Xtr[X+M]={X+MX++X+X+M(InXX+)}=X+MX++(IX+X)MX+X+.

Overall, we obtain the following Hessian-vector product:

(11)XΣD,M=ΛM+X+MX+(IX+X)MX+X+


Optimizing Σ=XX when X is full-rank

Equations (6) and (11) are also valid if X is a rank-L triangular (Choleskey) factorization of Σ. In this case the pseudoinverse can be replaced by the full inverse, and various terms simplify:

(12)XD=ΛXXXxD,M=ΛM+XMX


Optimizing Σ=Adiag[v]A

Let Σ=Adiag[v]A, where A is fixed and vRL are free parameters. Define diag[] as an operator that constructs a diagonal matrix from a vector, or extracts the main diagonal from a matrix if its argument is a matrix. The gradient of D in v is:

(13)XD=X12{tr[ΛAdiag[v]A]ln|Adiag[v]A|}=12{diag[AΛA]1v}

The hessian in v is a matrix in this case:

(14)HvD=12diag[1v2].

This parameterization is useful for spatiotemporal inference problems, where the matrix A represents a fixed convolution which can be evaluated using the Fast Fourier Transform (FFT).


Inverse-diagonal approximation

Let Σ1=Λ+diag[p]. To obtain the gradient in p, combine the derivatives ΣD (Eq. (2)) and pΣ using the chain rule. If f(Σ) is a function of Σ, and Σ(θi) is a function of a parameter θi, then the chain rule is (The Matrix Cookbook; Eq. 136):

(15)θif=Σf,θiΣ=kl(Σklf)(θiΣkl)

From (2) we have ΣD=12(ΛΣ1); Since Σ1=Λ+diag[p], this simplifies to:

(16)ΣD=12(ΛΣ1)=12(ΛΛdiag[p])=12diag[p]

We also need piΣ. Let Y=Σ1. The derivative Y1 is given as identity (59) in The Matrix Cookbook as Y1=Y1(Y)Y1. Using this, we can obtain piΣ:

(17)piΣ=piY1=Y1(piY)Y1=Σ(piΣ1)Σ=Σpi[Λ+diag[pi]]Σ=ΣJiiΣ=σiσi

where σi is the ith row of Σ and Jii is a matrix which is zero everywhere, except for at index (i,i), where it is 1.

Applying (15) to (16) and (17) for a particular element pi gives:

(18)piD=kl[ΣklD][piΣkl]=kl{12diag[p]}kl{σiσi}kl=12klδk=lpkσikσil=12kpkσikσik=12kpkσik2=12pσi2

where ()2 denotes the element-wise square of a vector or matrix. In matrix notation, this is:

(19)pD=12pΣ2=12diag[Σdiag[p]Σ],

The Hessian-vector product is cumbersome, since each term in the expression Σ(diag[p])Σ depends on p. In the case of the log-linear Poisson GLM, the gradient (???) simplifies further and optimization becomes tractable. We will explore this further in later notes.

This parameterization resembles the closed-form covariance update for a linear, Gaussian model, where 1/p is a vector of measurement noise variances. It is also a useful parameterization for variational Bayesian solutions for non-conjugate Generalized Linear Models (GLMs), where p becomes a free parameter to be estimated.


Optimizing Σ=FQQF

Let Σ=FQQF, where QRK×K; K<L is the free parameter and FRK×L is a fixed transformation. If Q is a lower-triangular matrix, then this approximation involves optimizing K(K+1)/2 parameters.

Since the trace is invariant under cyclic permutation, tr[ΛFQQF]=tr[FΛFQQ]. The derivatives have the same form as (12) with Λ~=FΛF:

(20)QD=Λ~QQ=FΛFQQQQD,M=Λ~M+QMQ=FΛFM+QMQ

This form is convenient for spatiotemporal inference problems that are sparse in frequency space. In this application, F corresponds a (unitary) Fourier transform with all by K of the resulting frequency components discarded. The product of F with a vector v can be computed in O[Llog(L)] time using the Fast Fourier Transform (FFT). Alternatively, if KO(log(L)), it is faster to simply multiply Fv directly. Furthermore, if F is semi-orthogonal (FF=I), then calculation of FQ can be re-used (for example diag[Σ]=[(FQ)2]1).


Conclusion

These notes provide the gradients and Hessian-vector products for four simplified parameterizations of the posterior covariance matrix for variational Gaussian process inference. If combined with the gradients and Hessian-vector products for the expected log-likelihood, these expressions can be used with Krylov-subspace solvers to compute the Newton-Raphson update to optimize Σ.

We evaluated the following parameterizations for Σ:

  • Σ:
(21)=12(ΛΣ1),M=12Σ1MΣ1
  • ΣXX:
(22)=ΛXX+.,M=ΛM+X+MX+(IX+X)MX+X+
  • ΣAdiag[v]A:
(23)=12{diag[AΛA]1v},u=12[1v2]u
  • Σ[Λ+diag[p]]1:
(24)=12pΣ2=12diag[Σdiag[p]Σ],
  • FQQF:
(25)=FΛFQQ,M=FΛFM+QMQ

In future notes, we will consider the full derivatives required for variational latent Gaussian-process inference for the Poisson and probit generalized linear models.

No comments:

Post a Comment