Monday, March 30, 2020

Gradient and Hessian for variational inference in Poisson and probit Generalized Linear Models

These notes contain some derivations for variational inference in Poisson and probit Generalized Linear Models (GLMs) with a Gaussian prior and approximated Gaussian posterior. ( see also here. )

Problem statement

Consider a population of neurons with firing-intensities $\boldsymbol\lambda =\rho(\boldsymbol\theta)$, where $\rho(\cdot)$ is a firing-rate nonlinearity and $\boldsymbol\theta$ is a vector of synaptic activations (amount of input drive to each neuron). For stochastic models of spiking $\Pr(y|\theta)$ in the canonical exponential family, the probability of observing spikes $\mathbf y$ given $\boldsymbol\theta$ can be written as \begin{equation}\begin{aligned}\ln\Pr(\mathbf y|\mathbf z) &=\mathbf y^\top\boldsymbol\theta-\mathbf 1^\top A(\boldsymbol\theta)+\text{constant},\end{aligned}\end{equation}

where $A(x)$ is a known function whose derivative equals the firing-rate nonlinarity, i.e. $A'(\cdot) =\rho(\cdot)$.

Assume that the synaptic activations $\boldsymbol\theta$ are driven by shared latent variables $\mathbf z$ with a Gaussian prior $\mathbf z\sim\mathcal N(\boldsymbol\mu_z,\boldsymbol\Sigma_z)$. Let $\boldsymbol\theta=\mathbf B\mathbf z$, where "$\mathbf B$" is a matrix of coupling coefficients which determine how the latent factors $\mathbf z$ drive each neuron.

We want to infer the distribution of $\mathbf z$ from observed spikes $\mathbf y$. The posterior is given by Bayes rule, $\Pr(\mathbf z |\mathbf y) =\Pr(\mathbf y |\mathbf z)\Pr(\mathbf z)/\Pr(\mathbf y)$. However, this posterior does not admit a closed form if $A(\cdot)$ is nonlinear. Instead, one can use a variational Bayesian approach to obtain an approximate posterior.

Variational Bayes

In variational Bayes, the posterior on $\boldsymbol z$ is approximated as Gaussian, i.e. $\Pr(\mathbf z |\mathbf y)\approx Q(\mathbf z)$, where $Q(\mathbf z) =\mathcal N(\boldsymbol\mu_q,\boldsymbol\Sigma_q)$. We then optimize $\boldsymbol\mu_q$ and $\boldsymbol\Sigma_q$ to minimize the Kullback-Leibler (KL) divergence from the true posterior $\Pr(\mathbf z |\mathbf y)$ to $Q(\mathbf z)$. This is equivalent to minimizing the KL divergenece $D_{\text{KL}}\left[ Q(\mathbf z)\|\Pr(\mathbf z)\right]$ from the prior to the posterior, while maximizing the expected log-likelihood $\left<\Pr(\mathbf y |\mathbf z)\right>$: \begin{equation}\begin{aligned}D_{\text{KL}}\left[ Q(\mathbf z)\|\Pr(\mathbf z |\mathbf y)\right]&=D_{\text{KL}}\left[ Q(\mathbf z)\|\Pr(\mathbf z)\right]-\left<\ln\Pr(\mathbf y |\mathbf z)\right>+\text{constant}.\end{aligned}\end{equation}

(In these notes, all expectations $\langle\cdot\rangle$ are taken with respect to the approximating posterior distribution.)

Since both $Q(\mathbf z)$ and $\Pr(\mathbf z)$ are multivariate Gaussian, the KL divergence $D_{\text{KL}}\left[ Q(\mathbf z)\|\Pr(\mathbf z)\right]$ has the closed form : \begin{equation}\begin{aligned} D_{\text{KL}}\left[ Q(\mathbf z)\|\Pr(\mathbf z)\right] &= \tfrac 1 2\left\{ (\boldsymbol\mu_z-\boldsymbol\mu_q)^\top \boldsymbol\Sigma_z^{-1} (\boldsymbol\mu_z-\boldsymbol\mu_q) + \operatorname{tr}\left( \boldsymbol\Sigma_z^{-1} \boldsymbol\Sigma_q \right) + \ln| \boldsymbol\Sigma_z^{-1} \boldsymbol\Sigma_q | \right\} +\text{constant}. \end{aligned}\end{equation}

For our choice of the canonically-parameterized natural exponential family, the expected negative log-likelihood can be written as: \begin{equation}\begin{aligned}-\langle\ln\Pr(\mathbf y|\mathbf z)\rangle&=\mathbf 1^\top\langle A(\boldsymbol\theta)\rangle-\mathbf y^\top\mathbf B\boldsymbol\mu_q+\text{constant}.\end{aligned}\end{equation}

Neglecting constants and terms that do not depend on $(\boldsymbol\mu_q,\boldsymbol\Sigma_q)$, the overall loss function to be minimized is: \begin{equation}\begin{aligned} \mathcal L(\boldsymbol\mu_q,\boldsymbol\Sigma_q) &= \tfrac 1 2\left\{ (\boldsymbol\mu_z-\boldsymbol\mu_q)^\top \boldsymbol\Sigma_z^{-1} (\boldsymbol\mu_z-\boldsymbol\mu_q) + \operatorname{tr}\left( \boldsymbol\Sigma_z^{-1} \boldsymbol\Sigma_q \right) + \ln|\boldsymbol\Sigma_z^{-1}\boldsymbol\Sigma_q|\right\} +\mathbf 1^\top\langle A(\boldsymbol\theta)\rangle -\mathbf y^\top\mathbf B\boldsymbol\mu_q \end{aligned} \label{loss}. \end{equation}

Closed-form expectations

To optimize $\eqref{loss}$, we need to differentiate it in $\boldsymbol\mu_q$ and $\boldsymbol\Sigma_q$. These derivatives are mostly straightforward, but the expectation $\langle A(\boldsymbol\theta)\rangle$ poses difficulties when $A(\cdot)$ is nonlinear. We'll consider some choices of firing-rate nonlinearity for which the derivatives of $\langle A(\boldsymbol\theta)\rangle$ have closed-form expressions when $\boldsymbol\theta$ is Gaussian.

Because we've assumed a Gaussian posterior on our latent state $\mathbf z$, and since $\boldsymbol\theta =\mathbf B\mathbf z$, the synaptic activations $\boldsymbol\theta$ are also Gaussian. The vectors $\boldsymbol\mu_\theta$ and $\boldsymbol\sigma^2_\theta$ for the mean and variance of $\boldsymbol\theta$, respectively, are: \begin{equation}\begin{aligned}\boldsymbol\mu_\theta &=\mathbf B\boldsymbol\mu_q\\\boldsymbol\sigma^2_\theta &=\operatorname{diag}\left[\mathbf B\boldsymbol\Sigma_q\mathbf B^\top\right]\end{aligned}\end{equation}

Consider a single, scalar $\theta\sim\mathcal N(\mu,\sigma^2)$. Using the chain rule and linearity of expectation, one can show that the partial derivatives $\partial_{\mu}\langle A(\theta)\rangle$ and $\partial_{\sigma^2}\langle A(\theta)\rangle$, with respect to $\mu$ and $\sigma^2$ respectively, are: \begin{equation}\begin{aligned}\partial_{\mu}\langle A(\theta)\rangle&=\langle A'(\theta) \rangle= \langle \rho(\theta) \rangle \\\partial_{\sigma^2}\langle A(\theta)\rangle&= \tfrac1{2\sigma^2}\left<(\theta-\mu_\theta) A'(\theta)\right>=\tfrac1{2\sigma^2} \left<(\theta-\mu)\rho(\theta)\right>.\end{aligned}\label{dexpect}\end{equation}

For more compact notation, denote the expected firing rate as $\bar\lambda=\langle\rho(\theta)\rangle$, and denote the expected derivative of the firing-rate in $\theta$ as $\bar\lambda'=\langle\rho'(\theta)\rangle$. Note that $\bar\lambda=\partial_{\mu}\langle A(\theta)\rangle$ and $\tfrac 1 2\bar\lambda'=\partial_{\sigma^2}\langle A(\theta)\rangle$.

Closed-form expressions for $\bar\lambda$ and $\bar\lambda'$ exist only in some special cases, for example if the firing-rate function $\rho(\cdot)$ is a (rectified) polynomial. We consider two choices of firing-rate nonlinearity which admit closed-form expressions, "exponential" and "probit".

-Choosing $\rho =\exp$ corresponds to a Poisson GLM. In this case, $\bar\lambda =\bar\lambda' =\exp(\mu +\sigma^2/2)$. -Let $\phi(\cdot)$ and $\Phi(\cdot)$ denote the probability density and cumulative distribution function, respectively, for a standard normal distribution. Choosing $\rho =\Phi$ corresponds to a probit GLM. In this case, $\bar\lambda =\Phi(\gamma\mu)$ and $\bar\lambda' =\gamma\phi(\gamma\mu)$, where $\gamma = (1+\sigma^2)^{-1}$.

For the probit firing-rate nonlinearity, we will also need to know $\partial_{\sigma^2}\langle\rho'(\boldsymbol\theta)\rangle$ to calculate the Hessian-vector product. In this case, $\rho'=\phi$. We have from $\eqref{dexpect}$ that $\partial_{\sigma^2}\langle\phi(x)\rangle=\tfrac1{2\sigma^2}\left<\theta (\mu-\theta)\phi(\theta)\right>$. This can be solved by writing the expectation as an integral and completing the square in the resulting Gaussian integral, yielding: \begin{equation}\begin{aligned} \partial_{\sigma^2}\langle\phi(x)\rangle &= \frac {u-1} {\sqrt{8\pi e^u(1+\sigma^2)^3}} ,\text{ where }u=\frac{\mu^2}{\sigma^2+1}. \end{aligned} \label{probitrhoprimeexpectgradient} \end{equation}

Derivatives of the loss function

With these prelimenaries out of the way, we can now consider the derivatives of $\eqref{loss}$ in terms of $\boldsymbol\mu_q$ and $\boldsymbol\Sigma_q$.

Derivatives in $\boldsymbol\mu_q$

The gradient and Hessian of $\mathcal L$ with respect to $\boldsymbol\mu_q$ are: \begin{equation}\begin{aligned} \partial_{\boldsymbol\mu_q} \mathcal L &= \boldsymbol\Sigma_z^{-1} (\boldsymbol\mu_q-\boldsymbol\mu_z) + \mathbf B^\top\left( \bar{\boldsymbol\lambda}-\mathbf y\right) \\ \operatorname H_{\boldsymbol\mu_q} \mathcal L &= \boldsymbol\Sigma_z^{-1} + \mathbf B^\top \operatorname{diag}[\bar{\boldsymbol\lambda}'] \mathbf B \end{aligned}\end{equation}

Gradient in $\boldsymbol\Sigma_q$

The gradient of $\eqref{loss}$ in $\boldsymbol\Sigma_q$ is more involved. The derivative of the term $\tfrac 1 2\{ \operatorname{tr}(\boldsymbol\Sigma_z^{-1}\boldsymbol\Sigma_q)+\ln|\boldsymbol\Sigma_z^{-1}\boldsymbol\Sigma_q|\}$ can be obtained using identities provided in The Matrix Cookbook . The derivative of $\mathbf 1^\top\langle A (\boldsymbol\theta)\rangle$ can be obtained by considering derivatives with respect to individual elements of $\boldsymbol\Sigma_q$, and is $\tfrac12\mathbf B^\top\operatorname{diag}[\bar\lambda']\mathbf B$. Overall, we find that: \begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma_q}{\mathcal L}= \tfrac 1 2 \left\{ \boldsymbol\Sigma_z^{-1}+ \boldsymbol\Sigma_q^{-\top}+ \mathbf B^\top\operatorname{diag}[\bar\lambda']\mathbf B \right\}. \end{aligned} \label{sigmagrad} \end{equation}

Hessian-vector product in $\boldsymbol\Sigma_q$

Since $\boldsymbol\Sigma_q$ is a matrix, the Hessian of $\eqref{loss}$ in $\boldsymbol\Sigma_q$ is a fourth-order tensor. It is simpler to work with the Hessian-vector product. Here, the "vector" is a covariance matrix $\mathbf M$ to be optimized. The Hessian-vector product is given by the following identity: \begin{equation}\begin{aligned} \langle\mathbf H_{\boldsymbol\Sigma_q},\mathbf M\rangle&= \partial_{\boldsymbol\Sigma_q}\langle \mathbf J_{\boldsymbol\Sigma_q},\mathbf M\rangle= \partial_{\boldsymbol\Sigma_q} \operatorname{tr}\left[ \mathbf J_{\boldsymbol\Sigma_q}^\top\mathbf M \right] \end{aligned}\end{equation}

where $\langle\cdot,\cdot\rangle$ denotes the scalar (Frobenius) product. The Hessian-vector product for the terms $\boldsymbol\Sigma_z^{-1}+\boldsymbol\Sigma_q^{-\top}$ in $\eqref{sigmagrad}$ can be obtained using identities provided in The Matrix Cookbook : \begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma_q} \operatorname{tr}\left[\left\{\boldsymbol\Sigma_z^{-1}+\boldsymbol\Sigma_q^{-\top}\right\}^\top\mathbf M\right]=-\boldsymbol\Sigma_q^{-1} \mathbf M^\top \boldsymbol\Sigma_q^{-1}. \end{aligned}\end{equation}

The Hessian-vector product for the term $\mathbf B^\top\operatorname{diag}[\bar\lambda']\mathbf B$ in $\eqref{sigmagrad}$ is more complicated. We can write \begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma_q} \operatorname{tr}\left[ \left\{ \mathbf B^\top\operatorname{diag}[\bar\lambda']\mathbf B \right\}^\top\mathbf M \right]&= \partial_{\boldsymbol\Sigma_q} \operatorname{tr}\left[ \mathbf B\mathbf M\mathbf B^\top \operatorname{diag}[\bar\lambda'] \right]\\&= \mathbf B^\top \operatorname{diag}[\mathbf B\mathbf M\mathbf B^\top] \operatorname{diag} \left[ \partial_{\sigma^2_{\boldsymbol\theta}}\langle\rho'(\boldsymbol\theta)\rangle \right] \mathbf B. \end{aligned} \label{hvp2} \end{equation}

The first step in $\eqref{hvp2}$ uses the fact that the trace is invariant under cyclic permutations. The second step follows from Lemma 1 (Appendix, below), with $\mathbf C=\mathbf B\mathbf M\mathbf B^\top$ and using the fact that $\bar\lambda'=\langle\rho'(\theta)\rangle$. In general, the Hessian-vector product in $\boldsymbol\Sigma_q$ is \begin{equation}\begin{aligned}\langle\mathbf H_{\boldsymbol\Sigma_q},\mathbf M\rangle&= \tfrac12\left\{-\boldsymbol\Sigma_q^{-1}\mathbf M^\top \boldsymbol\Sigma_q^{-1}+\mathbf B^\top\operatorname{diag}[\mathbf B\mathbf M\mathbf B^\top] \operatorname{diag}\left[\partial_{\sigma^2_{\boldsymbol\theta}}\langle\rho'(\boldsymbol\theta)\rangle \right]\mathbf B\right\} \end{aligned}\end{equation}

For the exponential firing-rate nonlinearity, $\partial_{\sigma^2_{\boldsymbol\theta}}\langle\rho'(\boldsymbol\theta)\rangle=\tfrac 1 2\bar\lambda$. The solution for the probit firing-rate nonlinearity is given in $\eqref{probitrhoprimeexpectgradient}$.

Conclude

That's all for now! I'll need to integrate these with the various other derivations ( e.g. see also here. ).

Appendix

Lemma 1

(We use Einstein summation to simplify the notation) \begin{equation}\begin{aligned} \partial_{\boldsymbol\Sigma_{q,ij}} \operatorname{tr} \left[ \mathbf C \operatorname{diag}\left[ \langle f(\boldsymbol\theta)\rangle \right] \right]&= \partial_{\boldsymbol\Sigma_{q,ij}} \left[ \mathbf C \operatorname{diag}\left[ \langle f(\boldsymbol\theta)\rangle \right] \right]_{kk}\\&= \partial_{\boldsymbol\Sigma_{q,ij}} \left[ \mathbf C_{lm} \operatorname{diag}\left[ \langle f(\boldsymbol\theta)\rangle \right]_{mn} \right]_{kk}\\&= \partial_{\boldsymbol\Sigma_{q,ij}} \left[ {\mathbf C}_{kk} \operatorname{diag}\left[ \langle f(\boldsymbol\theta)\rangle \right]_{k} \right]\\&= {\mathbf C}_{kk} \langle \partial_{\boldsymbol\Sigma_{q,ij}} f(\theta_k)\rangle\\&= {\mathbf C}_{kk} \mathbf B_{ik}^\top \,\partial_{\sigma^2_{\theta}}\langle f(\theta_k)\rangle \mathbf B_{kj}\\&= \mathbf B_{ik}^\top {\mathbf C}_{kk} \,\partial_{\sigma^2_{\theta}}\langle f(\theta_k)\rangle \mathbf B_{kj}\\&= \left\{ \mathbf B^\top \operatorname{diag}[\mathbf C] \operatorname{diag} \left[ \partial_{\sigma^2_{\boldsymbol\theta}}\langle f(\boldsymbol\theta)\rangle \right] \mathbf B \right\}_{ij} \\ \\ \partial_{\boldsymbol\Sigma_q} \operatorname{tr} \left[ \mathbf C \operatorname{diag}\left[ \langle f(\boldsymbol\theta)\rangle \right] \right]&= \mathbf B^\top \operatorname{diag}[\mathbf C] \operatorname{diag} \left[ \partial_{\sigma^2_{\boldsymbol\theta}}\langle f(\boldsymbol\theta)\rangle \right] \mathbf B \end{aligned}\end{equation}

Wednesday, March 25, 2020

Note: Derivatives of Gaussian KL-Divergence for some parameterizations of the posterior covariance for variational Gaussian-process inference

These notes provide the derivatives of the KL-divergence $D_{\text{KL}}\left[ Q(\mathbf z) \| P(\mathbf z)\right]$ between two multivariate Gaussian distributions $Q(\mathbf z)$ and $P(\mathbf z)$ with respect to a few parameterizations $\theta$ of the covariance matrix $\boldsymbol\Sigma(\theta)$ of $Q$. This is useful for variational Gaussian process inference, where clever parameterizations of the posterior covariance are required to make the problem tractable. Tables for differentiating matrix-valued functions can be found in The Matrix Cookbook .