import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def linear_regression_from_scratch():
"""
Implement linear regression using the closed-form solution.
Application: Model a neuron's firing rate as a function of stimulus intensity.
"""
np.random.seed(42)
# Generate synthetic neural data
# Scenario: Neuron responds linearly to stimulus with some noise
n_trials = 200
stimulus_intensity = np.linspace(0, 10, n_trials)
# True relationship: firing_rate = 2 + 3 * stimulus + noise
true_baseline = 2.0
true_gain = 3.0
noise = np.random.randn(n_trials) * 1.5
firing_rate = true_baseline + true_gain * stimulus_intensity + noise
# Prepare design matrix (add intercept)
X = np.column_stack([np.ones(n_trials), stimulus_intensity])
y = firing_rate
# Closed-form solution: β = (X^T X)^{-1} X^T y
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
# Predictions
y_pred = X @ beta_hat
# Calculate R²
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - y.mean()) ** 2)
r_squared = 1 - (ss_res / ss_tot)
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(stimulus_intensity, firing_rate, alpha=0.5,
label='Observed data', color='#cc0000')
ax.plot(stimulus_intensity, y_pred, 'b-', linewidth=2,
label=f'Fitted model: y = {beta_hat[0]:.2f} + {beta_hat[1]:.2f}x')
ax.plot(stimulus_intensity, true_baseline + true_gain * stimulus_intensity,
'g--', linewidth=2, label=f'True model: y = {true_baseline} + {true_gain}x')
ax.set_xlabel('Stimulus Intensity')
ax.set_ylabel('Firing Rate (Hz)')
ax.set_title(f'Linear Regression: Neural Encoding Model (R² = {r_squared:.3f})')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
print(f"True parameters: baseline={true_baseline}, gain={true_gain}")
print(f"Estimated parameters: baseline={beta_hat[0]:.2f}, gain={beta_hat[1]:.2f}")
print(f"R² = {r_squared:.3f}")
return beta_hat, r_squared
# linear_regression_from_scratch()16 [Model Fitting and Generalized Linear Model
Figure 10.3: Generalized Linear Models extend linear regression to different response types. Each distribution family has an appropriate link function that connects the linear predictor to the mean of the distribution. GLMs unify regression for continuous, binary, count, and positive continuous data.
Generalized Linear Models]{.chapter-title}
By the end of this chapter, you will be able to:
- Understand the philosophy and workflow of model fitting in neuroscience
- Implement linear regression and generalized linear models (GLMs) from scratch
- Apply GLMs to neural data, particularly spike train analysis
- Evaluate model performance using cross-validation and information criteria
- Combat overfitting
Figure 10.1: The fundamental bias-variance tradeoff in statistical modeling. Underfitting (high bias) fails to capture patterns. Overfitting (high variance) captures noise. The goal is to find the sweet spot that generalizes well to new data.
overfitting using regularization techniques (L1, L2) - Connect statistical models to neural encoding and decoding problems
16.1 10.1 The Philosophy of Model Fitting
At its core, model fitting is the process of adjusting a mathematical model’s parameters so that it can best describe a set of data. It’s a fundamental practice in both neuroscience and AI, allowing us to formalize our hypotheses about how a system works and to test those hypotheses against real-world observations. This chapter builds on the data science foundations introduced in Chapter 8.
The general workflow of model fitting involves several key steps:
Choosing a Model: This is the most critical step. The model is a mathematical representation of your hypothesis about the data-generating process. It could be a simple linear equation, a complex dynamical system, or a deep neural network. The choice of model should be guided by your scientific question and your prior knowledge of the system.
Choosing a Cost Function: The cost function (also known as a loss function or objective function) is a measure of how well the model fits the data. It quantifies the discrepancy between the model’s predictions and the actual data. The goal of the fitting process is to find the model parameters that minimize this cost function.
Fitting the Model: This is the optimization step, where we use an algorithm to find the parameters that minimize the cost function. For simple models, this can sometimes be done analytically, but for more complex models, we typically rely on numerical optimization methods.
Evaluating the Model: Once the model is fit, we need to evaluate its performance. This involves not only looking at how well it fits the training data but also, crucially, how well it generalizes to new, unseen data. This is where techniques like cross-validation become essential.
The Bias-Variance Tradeoff
A fundamental concept in model fitting is the bias-variance tradeoff:
- Bias: Error from overly simplistic assumptions. High bias → underfitting
- Variance: Error from sensitivity to fluctuations in training data. High variance → overfitting
- Goal: Find the sweet spot that minimizes total error
16.2 10.2 Linear Regression: A Simple Starting Point
The simplest and most common type of model is the linear regression model. In this model, we assume that the output variable is a linear function of the input variables, plus some noise.
Mathematical Formulation:
\[ y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2) \]
Where: - \(y\) is the response variable (e.g., neural firing rate) - \(\mathbf{X}\) is the design matrix (input features)
- \(\boldsymbol{\beta}\) are the coefficients we want to estimate - \(\boldsymbol{\epsilon}\) is Gaussian noise
Code Lab: Linear Regression from Scratch
While simple, linear regression is a powerful tool, and it forms the foundation for many more complex models. It’s also a great way to introduce the core concepts of model fitting, such as the design matrix, the cost function (in this case, the mean squared error), and the process of finding the optimal parameters.
16.3 10.3 Generalized Linear Models (GLMs)
Linear regression is a great tool, but it has its limitations. For one, it assumes that the noise in the data is Gaussian. This might be a reasonable assumption for some types of data, like fMRI BOLD signals, but it’s not a good assumption for other types of data, like spike counts, which are always non-negative integers.
This is where Generalized Linear Models (GLMs) come in. GLMs are a flexible framework that extends linear regression to handle a wider range of data types and noise distributions.
The GLM Framework
The GLM framework consists of three components:
- A linear predictor: \(\eta = \mathbf{X}\boldsymbol{\beta}\)
- A link function: \(g(\mu) = \eta\), which relates the mean response to the linear predictor
- A noise model (distribution family): e.g., Gaussian, Poisson, Bernoulli
Common GLM Specifications:
| Application | Noise Distribution | Link Function | Example |
|---|---|---|---|
| Continuous data | Gaussian | Identity: \(\mu = \eta\) | fMRI BOLD |
| Count data (spikes) | Poisson | Log: \(\log(\mu) = \eta\) | Neural spikes |
| Binary outcomes | Bernoulli | Logit: \(\log(\frac{\mu}{1-\mu}) = \eta\) | Choice data |
Code Lab: Poisson GLM for Spike Trains
The most common application of GLMs in neuroscience is modeling spike counts using a Poisson distribution.
def poisson_glm_spike_trains():
"""
Fit a Poisson GLM to model neural spike counts.
Scenario: Model how a neuron's spike count depends on stimulus features.
"""
np.random.seed(42)
# Generate synthetic spike data
n_trials = 300
# Two stimulus features: orientation (0-180°) and contrast (0-1)
orientation = np.random.rand(n_trials) * 180
contrast = np.random.rand(n_trials)
# True relationship (log-linear)
# log(λ) = β0 + β1 * orientation + β2 * contrast
beta_true = np.array([0.5, 0.01, 2.0])
X = np.column_stack([np.ones(n_trials), orientation, contrast])
log_rate = X @ beta_true
rate = np.exp(log_rate) # Inverse link (exponential)
# Generate Poisson spike counts
spike_counts = np.random.poisson(rate)
# Fit Poisson GLM using iteratively reweighted least squares (IRLS)
# For simplicity, we'll use scikit-learn
from sklearn.linear_model import PoissonRegressor
model = PoissonRegressor(alpha=0, max_iter=1000)
X_features = X[:, 1:] # Exclude intercept (model adds it automatically)
model.fit(X_features, spike_counts)
# Extract parameters
beta_hat = np.array([model.intercept_, *model.coef_])
# Predictions
rate_pred = np.exp(X @ beta_hat)
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Plot 1: Observed vs Predicted spike counts
axes[0].scatter(rate, spike_counts, alpha=0.4, c='#cc0000', label='Observed')
max_rate = max(rate.max(), rate_pred.max())
axes[0].plot([0, max_rate], [0, max_rate], 'k--', label='Perfect prediction')
axes[0].set_xlabel('True Rate (λ)')
axes[0].set_ylabel('Observed Spike Count')
axes[0].set_title('Poisson GLM: Observed vs True Rate')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Plot 2: Effect of contrast on firing rate
contrast_range = np.linspace(0, 1, 100)
orientation_fixed = 90 # Fix orientation at 90°
X_pred = np.column_stack([
np.ones(100),
np.full(100, orientation_fixed),
contrast_range
])
rate_pred_contrast = np.exp(X_pred @ beta_hat)
axes[1].plot(contrast_range, rate_pred_contrast, 'b-', linewidth=2,
label='Predicted')
axes[1].set_xlabel('Stimulus Contrast')
axes[1].set_ylabel('Predicted Firing Rate (Hz)')
axes[1].set_title(f'Effect of Contrast (orientation={orientation_fixed}°)')
axes[1].grid(alpha=0.3)
axes[1].legend()
plt.tight_layout()
print("True parameters:", beta_true)
print("Estimated parameters:", beta_hat)
print(f"Parameter recovery error: {np.linalg.norm(beta_true - beta_hat):.3f}")
return beta_hat
# poisson_glm_spike_trains()Neuroscience Insight: Poisson GLMs are the workhorse model for analyzing spike trains. They allow us to: - Model how neurons encode stimulus features (encoding model) - Decode stimulus information from neural activity (decoding) - Quantify tuning curves (e.g., orientation tuning in V1)
Maximum Likelihood Estimation
GLMs are typically fit using maximum likelihood estimation (MLE). The idea is to find parameters that maximize the probability of observing the data.
Log-likelihood for Poisson GLM:
\[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^{n} \left[ y_i \log(\lambda_i) - \lambda_i - \log(y_i!) \right] \]
where \(\lambda_i = \exp(\mathbf{x}_i^T \boldsymbol{\beta})\) is the predicted rate.
16.4 10.4 Overfitting and Regularization
A major challenge in model fitting is overfitting. This occurs when a model learns the training data too well, capturing not only the underlying signal but also the random noise. An overfit model will perform very well on the data it was trained on, but it will fail to generalize to new, unseen data.
Cross-Validation
The best way to detect overfitting is through cross-validation. Instead of training the model on all of the available data, we split the data into a training set and a testing set. We fit the model on the training set and then evaluate its performance on the testing set. If the model performs much better on the training set than on the testing set, it is likely overfit.
def demonstrate_cross_validation():
"""
Demonstrate k-fold cross-validation for model evaluation.
"""
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(42)
# Generate data
n = 100
X = np.linspace(0, 10, n).reshape(-1, 1)
y = 2 + 3 * X.ravel() + 0.5 * X.ravel()**2 + np.random.randn(n) * 5
# Try different polynomial degrees
degrees = [1, 2, 3, 5, 10, 15]
train_scores = []
cv_scores = []
for degree in degrees:
# Create polynomial features
poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)
# Fit model
model = Ridge(alpha=0.1)
model.fit(X_poly, y)
# Training score
train_score = model.score(X_poly, y)
train_scores.append(train_score)
# 5-fold cross-validation score
cv_score = cross_val_score(model, X_poly, y, cv=5,
scoring='r2').mean()
cv_scores.append(cv_score)
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(degrees, train_scores, 'o-', label='Training R²',
color='#0066cc', linewidth=2)
ax.plot(degrees, cv_scores, 's-', label='Cross-Validation R²',
color='#cc0000', linewidth=2)
ax.axhline(y=max(cv_scores), color='gray', linestyle='--', alpha=0.5)
ax.set_xlabel('Polynomial Degree')
ax.set_ylabel('R² Score')
ax.set_title('Overfitting Detection with Cross-Validation')
ax.legend()
ax.grid(alpha=0.3)
optimal_degree = degrees[np.argmax(cv_scores)]
print(f"Optimal polynomial degree: {optimal_degree}")
print(f"Best CV R²: {max(cv_scores):.3f}")
plt.tight_layout()
return degrees, train_scores, cv_scores
# demonstrate_cross_validation()Regularization: L1 and L2
One of the most effective ways to combat overfitting is regularization. Regularization is a technique that adds a penalty term to the cost function, discouraging the model from learning overly complex or extreme parameter values.
L2 Regularization (Ridge Regression):
\[ \text{Cost} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]
This method adds a penalty proportional to the square of the magnitude of the model’s weights. It encourages the model to use all of the input features but to keep the weights small.
L1 Regularization (Lasso Regression):
\[ \text{Cost} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \]
This method adds a penalty proportional to the absolute value of the weights. This has the effect of forcing some of the weights to be exactly zero, effectively performing a kind of automatic feature selection.
def compare_regularization():
"""
Compare L1 (Lasso) and L2 (Ridge) regularization.
"""
from sklearn.linear_model import Ridge, Lasso
np.random.seed(42)
# Generate data with many irrelevant features
n_samples = 100
n_features = 50
n_informative = 5 # Only 5 features are actually relevant
X = np.random.randn(n_samples, n_features)
# True coefficients (only first 5 are non-zero)
beta_true = np.zeros(n_features)
beta_true[:n_informative] = [3, -2, 4, -1, 2.5]
y = X @ beta_true + np.random.randn(n_samples) * 0.5
# Fit models
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=10000)
ridge.fit(X, y)
lasso.fit(X, y)
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# True coefficients
axes[0].bar(range(n_features), beta_true, color='gray', alpha=0.7)
axes[0].set_title('True Coefficients - (5 non-zero)')
axes[0].set_xlabel('Feature Index')
axes[0].set_ylabel('Coefficient Value')
# Ridge coefficients
axes[1].bar(range(n_features), ridge.coef_, color='#0066cc', alpha=0.7)
axes[1].set_title('Ridge (L2) Regularization - (All features shrunk)')
axes[1].set_xlabel('Feature Index')
axes[1].set_ylabel('Coefficient Value')
# Lasso coefficients
axes[2].bar(range(n_features), lasso.coef_, color='#cc0000', alpha=0.7)
axes[2].set_title(f'Lasso (L1) Regularization - ({np.sum(lasso.coef_ != 0)} non-zero features)')
axes[2].set_xlabel('Feature Index')
axes[2].set_ylabel('Coefficient Value')
for ax in axes:
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(alpha=0.3, axis='y')
plt.tight_layout()
print(f"Ridge: {np.sum(ridge.coef_ != 0)} non-zero coefficients")
print(f"Lasso: {np.sum(lasso.coef_ != 0)} non-zero coefficients")
print(f"True model: {n_informative} non-zero coefficients")
return ridge.coef_, lasso.coef_
# compare_regularization()Key Differences: - Ridge (L2): Shrinks all coefficients, but none to exactly zero. Good when all features contribute somewhat. - Lasso (L1): Drives some coefficients to exactly zero. Good for feature selection when many features are irrelevant.
Bayesian Perspective on Regularization
From a Bayesian perspective, regularization can be seen as imposing a prior on the model’s parameters:
- L2 regularization ≡ Gaussian prior: \(\beta_j \sim \mathcal{N}(0, 1/\lambda)\)
- L1 regularization ≡ Laplace prior: \(p(\beta_j) \propto \exp(-\lambda|\beta_j|)\)
This provides an elegant connection between frequentist regularization and Bayesian inference.
16.5 10.5 Model Selection: AIC
Figure 10.2: Model selection using information criteria. Training error always decreases with complexity, but test error has a minimum. AIC and BIC balance model fit with complexity penalties, with BIC penalizing complexity more heavily. The optimal model generalizes best to new data.
AIC and BIC
How do we choose between models of different complexity? Information criteria provide a principled way to balance fit quality with model complexity.
Akaike Information Criterion (AIC):
\[ \text{AIC} = 2k - 2\ln(\hat{L}) \]
Bayesian Information Criterion (BIC):
\[ \text{BIC} = k\ln(n) - 2\ln(\hat{L}) \]
Where: - \(k\) = number of parameters - \(n\) = number of observations - \(\hat{L}\) = maximum likelihood
Lower is better. BIC penalizes complexity more heavily than AIC.
def model_selection_demo():
"""
Demonstrate model selection using AIC and BIC.
"""
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
np.random.seed(42)
# Generate data (true model is quadratic)
n = 80
X = np.linspace(0, 10, n).reshape(-1, 1)
y = 2 + 3*X.ravel() - 0.5*X.ravel()**2 + np.random.randn(n) * 5
# Try different polynomial degrees
degrees = range(1, 8)
aic_scores = []
bic_scores = []
for degree in degrees:
# Fit model
poly = PolynomialFeatures(degree=degree, include_bias=True)
X_poly = poly.fit_transform(X)
model = LinearRegression()
model.fit(X_poly, y)
# Calculate residuals
y_pred = model.predict(X_poly)
residuals = y - y_pred
rss = np.sum(residuals**2)
# Log-likelihood (assuming Gaussian noise)
n = len(y)
sigma_sq = rss / n
log_likelihood = -n/2 * np.log(2 * np.pi * sigma_sq) - rss / (2 * sigma_sq)
# Number of parameters (including intercept)
k = degree + 1
# AIC and BIC
aic = 2*k - 2*log_likelihood
bic = k*np.log(n) - 2*log_likelihood
aic_scores.append(aic)
bic_scores.append(bic)
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(list(degrees), aic_scores, 'o-', label='AIC',
color='#0066cc', linewidth=2, markersize=8)
ax.plot(list(degrees), bic_scores, 's-', label='BIC',
color='#cc0000', linewidth=2, markersize=8)
best_aic = degrees[np.argmin(aic_scores)]
best_bic = degrees[np.argmin(bic_scores)]
ax.axvline(x=best_aic, color='#0066cc', linestyle='--', alpha=0.5,
label=f'Best AIC (degree={best_aic})')
ax.axvline(x=best_bic, color='#cc0000', linestyle='--', alpha=0.5,
label=f'Best BIC (degree={best_bic})')
ax.set_xlabel('Polynomial Degree')
ax.set_ylabel('Information Criterion (lower is better)')
ax.set_title('Model Selection: AIC vs BIC')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
print(f"Best model by AIC: degree {best_aic}")
print(f"Best model by BIC: degree {best_bic}")
print(f"True model: quadratic (degree 2)")
return aic_scores, bic_scores
# model_selection_demo()16.6 Exercises
Conceptual Questions
Linear Regression Assumptions: What are the key assumptions of linear regression? For which types of neuroscience data would these assumptions be violated?
GLM vs Linear Regression: Explain why a Poisson GLM is more appropriate than linear regression for modeling spike counts. What would go wrong if you used linear regression?
Regularization Intuition: In your own words, explain how L1 and L2 regularization prevent overfitting. When would you prefer one over the other?
Computational Problems
Implement GLM from Scratch: Implement a Poisson GLM using gradient descent (not using sklearn). Test it on synthetic spike train data.
Cross-Validation: Generate synthetic data where the true model is a 3rd-degree polynomial. Use k-fold cross-validation to determine the optimal polynomial degree. Plot train vs validation error.
Feature Selection: Generate data with 20 features, where only 3 are truly informative. Use Lasso regression to perform automatic feature selection. Compare your results to using all features with Ridge regression.
Neural Encoding Model: Load a real neural dataset (e.g., from CRCNS.org) and fit a GLM to predict spike counts from stimulus features. Evaluate your model using cross-validation.
Discussion Questions
Model Complexity: Discuss the trade-off between model interpretability and predictive power. When might you prefer a simple linear model over a more complex nonlinear model in neuroscience?
Bayesian vs Frequentist: Compare the frequentist approach (point estimates, confidence intervals) to the Bayesian approach (posterior distributions) for model fitting. What are the advantages of each?
Real-World Application: Design a GLM-based experiment to test whether a neuron in motor cortex encodes movement direction, speed, or both. What would your design matrix look like? How would you test your hypothesis?
16.7 References
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.
Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press.
Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2), 1074-1089.
Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network: Computation in Neural Systems, 15(4), 243-262.
Pillow, J. W., Shlens, J., Paninski, L., et al. (2008). Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207), 995-999.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Gelman, A., Carlin, J. B., Stern, H. S., et al. (2013). Bayesian Data Analysis (3rd ed.). CRC Press.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.