Monday, March 30, 2020

Gradient and Hessian for variational inference in Poisson and probit Generalized Linear Models

These notes contain some derivations for variational inference in Poisson and probit Generalized Linear Models (GLMs) with a Gaussian prior and approximated Gaussian posterior. ( see also here. )

Problem statement

Consider a population of neurons with firing-intensities λ=ρ(θ), where ρ() is a firing-rate nonlinearity and θ is a vector of synaptic activations (amount of input drive to each neuron). For stochastic models of spiking Pr(y|θ) in the canonical exponential family, the probability of observing spikes y given θ can be written as (1)lnPr(y|z)=yθ1A(θ)+constant,

where A(x) is a known function whose derivative equals the firing-rate nonlinarity, i.e. A()=ρ().

Assume that the synaptic activations θ are driven by shared latent variables z with a Gaussian prior zN(μz,Σz). Let θ=Bz, where "B" is a matrix of coupling coefficients which determine how the latent factors z drive each neuron.

We want to infer the distribution of z from observed spikes y. The posterior is given by Bayes rule, Pr(z|y)=Pr(y|z)Pr(z)/Pr(y). However, this posterior does not admit a closed form if A() is nonlinear. Instead, one can use a variational Bayesian approach to obtain an approximate posterior.

Variational Bayes

In variational Bayes, the posterior on z is approximated as Gaussian, i.e. Pr(z|y)Q(z), where Q(z)=N(μq,Σq). We then optimize μq and Σq to minimize the Kullback-Leibler (KL) divergence from the true posterior Pr(z|y) to Q(z). This is equivalent to minimizing the KL divergenece DKL[Q(z)Pr(z)] from the prior to the posterior, while maximizing the expected log-likelihood Pr(y|z): (2)DKL[Q(z)Pr(z|y)]=DKL[Q(z)Pr(z)]lnPr(y|z)+constant.

(In these notes, all expectations are taken with respect to the approximating posterior distribution.)

Since both Q(z) and Pr(z) are multivariate Gaussian, the KL divergence DKL[Q(z)Pr(z)] has the closed form : (3)DKL[Q(z)Pr(z)]=12{(μzμq)Σz1(μzμq)+tr(Σz1Σq)+ln|Σz1Σq|}+constant.

For our choice of the canonically-parameterized natural exponential family, the expected negative log-likelihood can be written as: (4)lnPr(y|z)=1A(θ)yBμq+constant.

Neglecting constants and terms that do not depend on (μq,Σq), the overall loss function to be minimized is: (5)L(μq,Σq)=12{(μzμq)Σz1(μzμq)+tr(Σz1Σq)+ln|Σz1Σq|}+1A(θ)yBμq.

Closed-form expectations

To optimize (5), we need to differentiate it in μq and Σq. These derivatives are mostly straightforward, but the expectation A(θ) poses difficulties when A() is nonlinear. We'll consider some choices of firing-rate nonlinearity for which the derivatives of A(θ) have closed-form expressions when θ is Gaussian.

Because we've assumed a Gaussian posterior on our latent state z, and since θ=Bz, the synaptic activations θ are also Gaussian. The vectors μθ and σθ2 for the mean and variance of θ, respectively, are: (6)μθ=Bμqσθ2=diag[BΣqB]

Consider a single, scalar θN(μ,σ2). Using the chain rule and linearity of expectation, one can show that the partial derivatives μA(θ) and σ2A(θ), with respect to μ and σ2 respectively, are: (7)μA(θ)=A(θ)=ρ(θ)σ2A(θ)=12σ2(θμθ)A(θ)=12σ2(θμ)ρ(θ).

For more compact notation, denote the expected firing rate as λ¯=ρ(θ), and denote the expected derivative of the firing-rate in θ as λ¯=ρ(θ). Note that λ¯=μA(θ) and 12λ¯=σ2A(θ).

Closed-form expressions for λ¯ and λ¯ exist only in some special cases, for example if the firing-rate function ρ() is a (rectified) polynomial. We consider two choices of firing-rate nonlinearity which admit closed-form expressions, "exponential" and "probit".

-Choosing ρ=exp corresponds to a Poisson GLM. In this case, λ¯=λ¯=exp(μ+σ2/2). -Let ϕ() and Φ() denote the probability density and cumulative distribution function, respectively, for a standard normal distribution. Choosing ρ=Φ corresponds to a probit GLM. In this case, λ¯=Φ(γμ) and λ¯=γϕ(γμ), where γ=(1+σ2)1.

For the probit firing-rate nonlinearity, we will also need to know σ2ρ(θ) to calculate the Hessian-vector product. In this case, ρ=ϕ. We have from (7) that σ2ϕ(x)=12σ2θ(μθ)ϕ(θ). This can be solved by writing the expectation as an integral and completing the square in the resulting Gaussian integral, yielding: (8)σ2ϕ(x)=u18πeu(1+σ2)3, where u=μ2σ2+1.

Derivatives of the loss function

With these prelimenaries out of the way, we can now consider the derivatives of (5) in terms of μq and Σq.

Derivatives in μq

The gradient and Hessian of L with respect to μq are: (9)μqL=Σz1(μqμz)+B(λ¯y)HμqL=Σz1+Bdiag[λ¯]B

Gradient in Σq

The gradient of (5) in Σq is more involved. The derivative of the term 12{tr(Σz1Σq)+ln|Σz1Σq|} can be obtained using identities provided in The Matrix Cookbook . The derivative of 1A(θ) can be obtained by considering derivatives with respect to individual elements of Σq, and is 12Bdiag[λ¯]B. Overall, we find that: (10)ΣqL=12{Σz1+Σq+Bdiag[λ¯]B}.

Hessian-vector product in Σq

Since Σq is a matrix, the Hessian of (5) in Σq is a fourth-order tensor. It is simpler to work with the Hessian-vector product. Here, the "vector" is a covariance matrix M to be optimized. The Hessian-vector product is given by the following identity: (11)HΣq,M=ΣqJΣq,M=Σqtr[JΣqM]

where , denotes the scalar (Frobenius) product. The Hessian-vector product for the terms Σz1+Σq in (10) can be obtained using identities provided in The Matrix Cookbook : (12)Σqtr[{Σz1+Σq}M]=Σq1MΣq1.

The Hessian-vector product for the term Bdiag[λ¯]B in (10) is more complicated. We can write (13)Σqtr[{Bdiag[λ¯]B}M]=Σqtr[BMBdiag[λ¯]]=Bdiag[BMB]diag[σθ2ρ(θ)]B.

The first step in (13) uses the fact that the trace is invariant under cyclic permutations. The second step follows from Lemma 1 (Appendix, below), with C=BMB and using the fact that λ¯=ρ(θ). In general, the Hessian-vector product in Σq is (14)HΣq,M=12{Σq1MΣq1+Bdiag[BMB]diag[σθ2ρ(θ)]B}

For the exponential firing-rate nonlinearity, σθ2ρ(θ)=12λ¯. The solution for the probit firing-rate nonlinearity is given in (8).

Conclude

That's all for now! I'll need to integrate these with the various other derivations ( e.g. see also here. ).

Appendix

Lemma 1

(We use Einstein summation to simplify the notation) (15)Σq,ijtr[Cdiag[f(θ)]]=Σq,ij[Cdiag[f(θ)]]kk=Σq,ij[Clmdiag[f(θ)]mn]kk=Σq,ij[Ckkdiag[f(θ)]k]=CkkΣq,ijf(θk)=CkkBikσθ2f(θk)Bkj=BikCkkσθ2f(θk)Bkj={Bdiag[C]diag[σθ2f(θ)]B}ijΣqtr[Cdiag[f(θ)]]=Bdiag[C]diag[σθ2f(θ)]B

Wednesday, March 25, 2020

Note: Derivatives of Gaussian KL-Divergence for some parameterizations of the posterior covariance for variational Gaussian-process inference

These notes provide the derivatives of the KL-divergence DKL[Q(z)P(z)] between two multivariate Gaussian distributions Q(z) and P(z) with respect to a few parameterizations θ of the covariance matrix Σ(θ) of Q. This is useful for variational Gaussian process inference, where clever parameterizations of the posterior covariance are required to make the problem tractable. Tables for differentiating matrix-valued functions can be found in The Matrix Cookbook .