Saturday, January 14, 2012

Bayesian approach to point-process generalized linear models

I've been learning about generalized linear models and Bayesian approaches for doing statistics on spike train data, in the Truccolo lab. Here are some notes on the subject. 

[get notes as a PDF]

In neuroscience, we are interested in the problem of how neurons encode, process, and communicate information. Neurons communicate over long distances using brief all-or-nothing events called spikes. We are often interested in how the spiking rate of a neuron depends on other variables, such as stimuli, motor output, or other ongoing signals in the brain. 

To model this, we consider spikes as events that occur at a point in time with an underlying variable rate, or conditional intensity, λ. There are many approaches to estimating λ. These notes cover point-process generalized linear models, and Bayesian approaches. These are closely related, and in some cases the same thing.

Point-process generalized linear models

In point-process Generalized Linear Models (GLM) we estimate ln(λ) (or logit(λ)), which can take on any real value. The GLM estimates ln(λ) directly as a linear combination ln(λ)=iaixi of some covariates xi. The coefficients a are optimized using likelihood. 

For computational tractability we work with a discretized version of the point process. A point process can be converted into a series of non-negative integers by counting the number of events y that occur in a fixed time interval Δ. If Δ is small, such that λ is approximately constant within the interval, then the distribution of this new count process can be written as:

Pr(y=k)=(λΔ)k(1λΔ)1k,kZ

If we choose Δ sufficiently small such that Pr(y>1)0, we can restrict analysis to the approximation

Pr(y=1)=λΔ,Pr(y=0)=1λΔ

For an alternative derivation, consider the Poisson distribution, which defines the probability of observing k events in time Δ where the expected number of events is λΔ:

Pr(y=k)=(λΔ)keλΔk!

In the limit of Δ sufficiently small, Pr(y>1) becomes neglegible, and we consider only

Pr(y=1)=(λΔ)eλΔ

Since Δ is small, eλΔ1, and we have

Pr(y=1)λΔ

Since Pr(y=0) can be computed from Pr(y=1) in this case, we focus on estimating Pr(y=1),

which we will denote, for convenience, as Pr(1) or simply P. Note that the log-linear model also works in the discrete case, since multiplication of λ by Δ can be absorbed into a constant term in the model.

Pr(1)ln(λΔ)=iaixi

For now, restrict analysis to a single covariate x. In the event that the relationship between ln(λΔ) and x is not entirely captured by a log-linear model, we may want to add nonlinear functions of x to the model. In this way, we can capture a wider range of possible relationships between x and ln(λΔ). However, it is sometimes challenging to select nonlinear terms, and adding irrelevant terms can increase the time required to fit the model, and increase over-fitting. The general form of this case is given by 

Pr(1)ln(λΔ)=iaifi(x)

Connection between GLM and Bayesian approach

We can also derive a model for conditional intensity using Bayes rule. We are interested in learning how Pr(1) might depend on another covariate x. That is, we would like to know how the conditional intensity λxΔPr(1|x) differs from baseline Pr(x). We can apply Bayes rule to directly solve for this conditional distribution in terms of more easily observed Pr(x|1):

λxΔPr(1|x)=Pr(x|1)Pr(1)Pr(x)

For computational efficiency we work with the natural logarithm of probability. This yields an expression that allows us to directly estimate conditional intensitiy in terms of the log probability density functions (PDF) of x and x|1.

(1)ln(λxΔ)ln(Pr(1|x))=ln(Pr(x|1))ln(Pr(x))+ln(P)

The performance of this approach depends on accurate modeling of Pr(x|1) and Pr(x). If these quantities follow distributions whos parameters can be estimated quickly from available data, this approach can be faster than likelihood maximization of the more general log-linear model. If the expression for the log-likelihood can be factored into a linear combination of fixed nonlinear functions of x, then we can directly relate the Bayesian approach to the log-linear model :

(1)ln(Pr(1|x))=ln(Pr(x|1))ln(Pr(x))+ln(P)=iaifi(x)

While we can always approximate the above using a series expansion of ln(Pr(1|x)), a closed form solution of this relationship exists if the log PDFs of Pr(x|1) and Pr(x) can each be factored as  iaifi(x), even if Pr(x|1) and Pr(x) follow different distributions.

We show in the next section that distributions from the exponential family meet these conditions, allowing direct solution of the parameters to a log-linear model from data statistics. Choosing a parametric distribution for Pr(x|1) and Pr(x) also determines the set of nonlinear terms required for a log-linear model to capture the relationship between x and ln(λ), as well as their weights.

Selecting nonlinear features for a GLM point process model

In the previous section we discussed a direct solution for the conditional intensity using Bayes rule, and how this solution might relate to log-linear models. Specific examples are given in this section. The general steps are as follows :

  • Look at the marginal distribution of the covariate Pr(x), and the distribution conditioned on the presence of a spike Pr(x|1).
  • Pick a parametric probability distribution family that fits the observed distributions well.
  • Look up the log probability-density function.
  • Write down the log-likelihood ratio in terms of  the log PDF as in (1).
  • Expand this function until it is of the form iaifi(x) where ai is a real valued parameter and fi is a function of the covariate. 
  • The functions fi(x) are the nonlinear features of x that you should include in a log-linear model, and the coefficients ai provide an approximate fit of that model.

Exponential data imply a log-linear model

If a covariate follows an exponential distribution, the Bayesian method provides parameters for a log-linear point process model.  The probability density for an exponential distribution has one scale parameter λ, Pr(x;λx)=λxeλxx, which gives the log probability as ln(Pr(x;λx))=ln(λx)λxx. Substituting this into equation (1) yields

ln(λΔ)(ln(λx|1)λx|1x)(ln(λx)λxx)+ln(P)

Collecting terms yields the the linear and constant terms in a log linear inhomogenous poisson process model: 

ln(λΔ)(λxλx|1)x+(ln(λx|1)ln(λx)+ln(P))

Normally distributed data imply a log-quadratic model

If a covariate follows a Gaussian distribution, the Bayesian classifier provides parameters for a log-quadratic point process model ln(λ)=ax2+bx+c. The probability density for a Gaussian variable is

N(μ,σ)(x)=exp[(xμ)2/(2σ2)]/(σ2π),

which gives the log-density

ln(N(μ,σ)(x))=(xμ)2/(2σ2)ln(σ)ln(2π)/2. 

Substituting this into equation (1) and ignoring the ln(2π)/2 terms, which immediately cancel, gives : 

ln(λΔ)=12σx2(xμx)2+ln(σx)(12σx|12(xμx|1)2+ln(σx|1))+ln(P)

Expanding the quadratic expressions and collecting terms yields the constant, linear, and quadratic terms for a log-quadratic inhomogeneous Poisson model:

ln(λΔ)=(1/(2σx2)1/(2σx|12))x2+(μx|1/σx|12μx/σx2)x+μx2/(2σx2)μx|12/(2σx|12)+ln(σx)ln(σx|1)+ln(P)

If the covariate x has been z-scored such that σx=1 and μx=0, the expression simplifies to:

ln(λΔ)=(1/21/(2σx|12))x2+(μx|1/σx|12)x+ln(P)μx|12/(2σx|12)ln(σx|1)

Gamma distributed data imply ln(x) nonlinear features

The Gamma PDF in terms of a shape α and inverse scale β parameter is 

Pr(x;α,β)=βαxα1exp(xβ)/Γ(α), 

which gives the log PDF

ln(Pr(x;α,β))=αln(β)+(α1)ln(x)βxln(Γ(α)). 

Substituting this into  equation (1) yields 

ln(λΔ)=(αx|1ln(βx|1)+(αx|11)ln(x)βx|1xln(Γ(αx|1))(αxln(βx)+(αx1)ln(x)βxxln(Γ(αx)))+ln(P)

Collecting terms yields a fit for a model that, in addition to linear and constant terms, includes a log(x) term

ln(λΔ)=(βxβx|1)x+(αx|1αx)ln(x)+ln(Γ(αx))ln(Γ(αx|1))+αx|1ln(βx|1)αxln(βx)+ln(P)

Von Mises imply sine and cosine nonlinear terms

There is also a GLM formulation for von Mises distributed data. This is left left as an exercize to reader. To derive it, move the preferred phase parameter out of the cosine using trigonometric identities. The nonlinearities implied for the GLM are sin(θ) and cos(θ).

Generalization to the exponential family

The exponential family of distributions inclues all the distributions discussed so far in this section, and follows the canonical form 

Pr(x|θ)=h(x)g(θ)eθT(x)

Where xRn,nN is an n dimensional real valued vector space, θRm,mN is an m dimensional real valued parameter,  h:RnR maps from x to a real number, g:RmR maps from θ to a real number, and T:RnRm maps from x to θ. This impies the canonical log PDF

ln(Pr(x|θ))=ln(h(x))+ln(g(θ))+θT(x)

This form is already suitable for mapping onto a log linear point process model. The term ln(g(θ)) is a constant offset implied by the parameters, and present only for normalization. The term θT(x) a weighted sum of functions of x, and the term ln(h(x)) is also a function of x with weight 1. If θ=[θ1,...,θm], x=[x1,...,xn], and T=[f1(x),...,fm(x)], the log PDF can be written as:

ln(Pr(x))=iaifi(x)=ln(g(θ))+ln(h(x))+j=1mθjfj(x)

From this, the coditional intensity in the general case where x and x|1 may have different distributions from the exponential family is:

ln(λΔ)=iaifi(x)=(ln(gx|1(θx|1))+ln(hx|1(x))+j=1mx|1θx|1,jfx|1,j(x))(ln(gx(θx))+ln(hx(x))+j=1mxθx,jfx,j(x))+ln(P)

If x and x|1 are both in the exponential family, it is always possible to pick a more general distribtuion from the exponential family that includes both x and x|1, and so we may consider

ln(λΔ)=iaifi(x)=(ln(g(θx|1))+ln(h(x))+j=1mθx|1,jfj(x))(ln(g(θx))+ln(h(x))+j=1mθx,jfj(x))+ln(P)

The ln(h(x)) terms cancel, although logarithmic terms may still be introduced via fj

ln(λΔ)=iaifi(x)=(ln(g(θx|1))+j=1mθx|1,jfj(x))(ln(g(θx))+j=1mθx,jfj(x))+ln(P)

Collecting terms

ln(λΔ)=iaifi(x)=(lng(θx|1)g(θx)+ln(P))+j=1m(θx|1,jθx,j)fj(x)

If x and x|1 can be modeled by any distribution in the exponential family, we can write a log linear model directly from Bayes' rule. If the canonical form of the distribution has a closed form, we can also solve for the model weights. 

Relationships determined by fitting a GLM  can be simpler than those implied by a Bayesian approach

Although solving directly for conditional intensity using Bayes rule can imply a log-linear model and its coefficients, the converse is not in general true. For example, the quadratic terms from a Gaussian model vanish when σx=σx|1, reducing it to a simple log-linear model. 

In general, a purely log-linear GLM will be able to fit the data if the distributions Pr(x) and Pr(x|1) differ only by a location parameter. 

For a particular choice of features f(x), the GLM can be more accurate than a parametric Bayesian approach, because it can flexibly model data from a broader class of distributions. 

Additionally, the GLM directly optimizes parameters that summarize the difference between Pr(x) and Pr(x|1), rather than finding parameters for each distribution separately. This makes the GLM more statistically efficient, as less data is required to constrain a smaller number of parameters. 

Nevertheless, inspecting the distributions Pr(x) and Pr(x|1) provides important clues for which nonlinear features f(x) to include in a GLM regression.

Kullback-Leibler divergence DKL(x|1x) is an easily computed predictor model performance

Mutual information is a statistic used to summarize how related two variables are. Consider the problem of measuring the mutual information between a variable x and point process y

Estimating the mutual information between a point process and an external variable reduces to estimating the Kullback-Leibler divergence of the conditional Pr(x|1) from background Pr(x), both of which we have already computed in a Bayesian fit of the point process model. entropy, as:

I(x,y)=EyDKL(x|yx)

Spikes are rare in a point process, so Pr(x|0)Pr(x). Since DKL is zero if both distributions are identical, mutual information is (in the sparse limit) reduces to

I(x,y)=Pr(y=1)DKL(x|1x)

The mutual information between x and y is the KL divergence of Pr(x|1) and Pr(x), multiplied by Pr(y=1). Since Pr(y=1) is a background term that depends on the choice of Δ, it is not especially relevant, except when comparing two point processes with different underlying rates. 

Let's explore this in the case that Pr(x|1) and Pr(x) are normally distributed. Substituting Pr(x|1) and Pr(x) into the formula for the GL divergence between two Gaussian variables yields

DKL(x|1x)=(μx|1μx)22σx2+12(σx|12σx21lnσx|12σx2) 

If x has been z scored and has unit variance zero mean,

DKL(x|1x)=12(μx|12+σx|121lnσx|12)

If we assume σx|1σx=1 (which is the implicit assumption if we were to fit a GLM with only the linear feature x), this simplifies further; 

DKL(x|1x)=12μx|12

From this we see that the fraction of information about y captured by x depends mainly on the squared deviation of the spike-triggered average μx|1 from the baseline μx=0.

Incidentally, this also suggests that fancy Bayesian and GLM models might not tell you that much more than the Spike Triggered Average (STA), in some cases. 

Another curious observation which holds empirically, at least for low information variables examined so far, is the relationship

2ln(2AUC1)+12ln(DKL(x|1x))

Where AUC is the area under the receiver operating characteristic ( ROC ) curve, and is used to summarize the accuracy of a point-process decoder. (I'm not sure whether this approximation is really valid or under what conditions it might hold)

Overall,

Models that fit the log-intensity of an inhomogeneous Poisson point process are related to likelihood-ratio based Bayesian classifiers. Solving for the conditional intensity using Bayes' rule, and distributions from the exponential family, is one way to find parameters for a point process model. The distribution family of x also provides clues as to which nonlinear features to incorporate into a point process GLM. The parametric Bayesian approaches also suggest simple closed-form formulae for measuring mutual information between a spike train and an external variable. 

No comments:

Post a Comment