MCMC Methods and the Metropolis–Hastings Algorithm
“Bayesian estimation takes the model seriously as a data-generating process, while acknowledging that we are uncertain about its parameters.” — Frank Smets and Raf Wouters
Cross-reference: Principles Appendix B (Bayesian estimation, Metropolis–Hastings); Ch. 27 (Smets–Wouters calibration and estimation) [P:AppB, P:Ch.27]
30.1 Why Bayesian Estimation?¶
Chapter 21 developed GMM and MLE for structural parameter estimation. For DSGE models, Bayesian estimation — combining the likelihood with prior distributions over parameters — has become the dominant approach for three reasons:
1. Identification. DSGE models typically have many parameters but limited observables. Many parameter combinations are nearly observationally equivalent — different parameter values produce similar time series. The prior provides regularization, guiding the posterior toward economically plausible regions of the parameter space.
2. Incorporating prior information. Microeconomic studies provide independent evidence about preference and technology parameters (e.g., labor supply elasticities from household data, price rigidity from firm surveys). The Bayesian prior formalizes this information.
3. Model comparison. The Bayesian framework provides a principled method for comparing non-nested models via the Bayes factor — the ratio of marginal likelihoods. This is particularly useful when comparing different DSGE specifications.
The price: Bayesian inference requires specifying priors and running MCMC chains that can take hours to converge. For large models (Smets–Wouters with 41 estimated parameters), each likelihood evaluation takes seconds, and the chain needs 105–106 draws. In practice, this is done using Dynare’s optimized implementation.
30.2 Bayesian Inference: The Posterior Distribution¶
Bayes’ theorem for parameters:
where:
— posterior: our beliefs about after seeing the data.
— likelihood: probability of the data given parameters (from Kalman filter, Chapter 20).
— prior: beliefs before seeing the data.
— marginal likelihood: normalizing constant.
The posterior combines the prior with the likelihood. When the likelihood is very informative, the posterior is concentrated near the MLE regardless of the prior. When the data are sparse, the posterior is closer to the prior.
Definition 30.1 (Posterior Mode and Posterior Mean). The posterior mode (maximum a posteriori) maximizes . The posterior mean is the Bayes estimator under squared error loss.
Finding the posterior mode is an optimization problem (Chapter 24); computing the posterior mean requires integrating over the posterior distribution — which is analytically intractable for DSGE models. MCMC methods approximate the posterior by sampling.
30.3 Prior Distributions for DSGE Parameters¶
The choice of prior reflects both economic beliefs and practical considerations.
Standard choices:
| Parameter | Typical Prior | Support | Motivation |
|---|---|---|---|
| Calvo probability | Beta() | Fraction, must be in | |
| CRRA | Gamma() | Must be positive | |
| AR persistence | Beta() | Must be in for stationarity | |
| Taylor rule | Normal (truncated) | Determinacy requires | |
| Shock std dev | Inverse-Gamma | Scale parameter, must be positive | |
| Investment adj. cost | Normal (positive) | Must be positive |
Definition 30.2 (Conjugate Prior). A prior is conjugate to the likelihood if the posterior is in the same distributional family as the prior. Conjugate priors allow analytical posterior updates.
For DSGE models, the full likelihood is nonlinear in parameters (through the Kalman filter), so no conjugate priors exist. MCMC is necessary.
30.4 The Metropolis–Hastings Algorithm¶
The Metropolis–Hastings (MH) algorithm generates samples from the posterior distribution without computing the normalizing constant .
Algorithm 30.1 (Metropolis–Hastings).
Initialize: (e.g., the posterior mode from optimization). Scale matrix (proposal covariance, typically tuned to give 20–30% acceptance rate).
For :
Propose: Draw (symmetric random walk proposal).
Compute acceptance ratio:
Accept/Reject: Draw .
If : (accept).
If : (reject, stay).
Theorem 30.1 (MH Invariance). The MH chain with target has as its stationary distribution. Starting from any initial point , the chain converges in distribution to under mild regularity conditions (irreducibility, aperiodicity).
Proof sketch. The detailed balance condition ensures is stationary. For any two states :
This is verified by considering the two cases and . Detailed balance stationarity. Irreducibility and aperiodicity convergence.
In APL, the MH chain is a scan over draws:
⍝ APL — Metropolis-Hastings chain
⎕IO←0 ⋄ ⎕ML←1
⍝ Scalar log-posterior (placeholder: in production, ⎕PY calls the Kalman filter)
⍝ Posterior ∝ L(theta) × prior = N(theta|1.5,1) × N(theta|1.0,1) → N(1.25, 0.5)
log_post ← {
ll ← -0.5×(⍵-1.5)*2 ⍝ Gaussian log-likelihood
lp ← -0.5×(⍵-1.0)*2 ⍝ Normal(1, 1) log-prior
ll + lp}
⍝ Box-Muller standard normal: cos branch (1○ = cosine, not 2○ = sine)
rnorm ← {1○(2×○1×?0) × (-2×⍟?0)*0.5} ⍝ scalar standard normal draw
⍝ One MH step: ⍵ is a 2-element vector (theta, n_accepted)
⍝ Returns updated (theta_new, n_accepted_new)
mh_step ← {
theta n_acc ← ⍵
proposal ← theta + scale × rnorm ⍬ ⍝ random walk proposal
log_ratio ← (log_post proposal) - log_post theta
accept ← log_ratio > ⍟?0 ⍝ log(U[0,1]) < log_ratio
new_theta ← accept⌷theta proposal ⍝ pick proposal if accepted, else theta
new_theta (n_acc+accept)} ⍝ accumulate acceptance count
⍝ Run chain: N_draws iterations in O(N) time using ⍣ (power operator)
⍝ ⍣ iterates the function exactly N_draws times — NOT a scan (which is O(N²))
N_draws ← 10000
scale ← 0.5 ⍝ proposal standard deviation (tune for 20–30% acceptance)
theta0 ← 0 ⍝ starting value
state0 ← theta0 0 ⍝ initial state: (theta, n_accepted)
⍝ Collect chain: run mh_step N_draws times, storing each theta
⍝ We use a loop via ⎕DF (or direct accumulation) since ⍣ doesn't return history.
⍝ The idiomatic APL approach for storing all draws is explicit accumulation:
chain ← (N_draws+1) ⍴ theta0 ⍝ pre-allocate chain vector (N_draws+1 draws)
state ← state0
{
chain[⍵] ← ⊃state ⍝ store current theta
state ∘← mh_step state ⍝ advance state (⍣1 equivalent)
} ¨ 1+⍳N_draws ⍝ iterate over draw indices
⍝ Final acceptance rate from accumulated count in state
n_accepted ← state[1]
⍝ Posterior statistics (post-burnin)
burnin ← 2000
samples ← burnin↓chain
post_mean ← (+/samples) ÷ ≢samples
post_std ← ((+/(samples-post_mean)*2) ÷ ≢samples)*0.5
acc_rate ← n_accepted ÷ N_draws ⍝ overall acceptance rate
⍝ Acceptance count among consecutive distinct draws (alternative rate measure)
transitions ← (1↓chain) ≠ ¯1↓chain ⍝ 1 where chain moved; APL drop, not Python slice
acc_rate_alt ← (+/transitions) ÷ N_draws
post_mean post_std acc_rate ⍝ target: mean≈1.25, std≈0.71, rate≈20–40%30.5 MCMC Diagnostics¶
Running an MCMC chain is only half the task — the other half is verifying that the chain has converged to the target distribution and that the draws provide reliable posterior estimates.
30.5.1 Trace Plots and Mixing¶
The trace plot shows the parameter value at each iteration. A well-mixing chain moves freely around the parameter space, not getting stuck in one region. Poor mixing (high autocorrelation in the chain) suggests the proposal variance is too small.
Definition 30.3 (Effective Sample Size). The effective sample size accounts for autocorrelation in the chain:
where is the sample autocorrelation at lag . If the chain has high autocorrelation, — 10,000 correlated draws may be worth only 100 independent draws.
30.5.2 The Gelman–Rubin Statistic¶
Definition 30.4 (Gelman–Rubin Statistic). Run independent chains from different starting points, each of length . The statistic compares within-chain variance to between-chain variance:
where (average within-chain variance), (between-chain variance), and (pooled variance estimate).
indicates convergence (within- and between-chain variances agree). The standard threshold: convergence when for all parameters.
30.6 The DSGE Likelihood via the Kalman Filter¶
The log-likelihood for the DSGE model with parameters is evaluated using the Kalman filter (Chapter 20):
where:
From : construct the DSGE model matrices , , .
Solve using gensys (Chapter 28): get decision rules , .
Map to state-space form: , , , .
Run Kalman filter: get innovations and innovation variances .
Sum the log-likelihood contributions.
Each MH iteration requires one full Kalman filter evaluation — the bottleneck of the estimation.
30.7 Model Comparison: Marginal Likelihood and Bayes Factors¶
Definition 30.5 (Marginal Likelihood). The marginal likelihood of model is:
The Bayes factor for model vs. :
If : is supported by the data; is “strong” evidence.
The Laplace approximation for the marginal likelihood:
where is the negative Hessian of the log-posterior at the mode and is the number of parameters. The Laplace approximation is used in Dynare as the default marginal likelihood estimate.
30.8 Worked Example: MH for a 3-Parameter NK Model¶
Python¶
import numpy as np
from scipy.stats import norm, gamma, beta as beta_dist
from scipy.linalg import solve_discrete_lyapunov
# Simplified 3-parameter NK model
# Parameters: theta = [kappa, phi_pi, sigma_u]
# Prior: kappa ~ Beta(2,10), phi_pi ~ N(1.5, 0.25) truncated [1,∞), sigma_u ~ IG(0.1, 2)
def log_prior(theta):
kappa, phi_pi, sigma_u = theta
if kappa <= 0 or kappa >= 1 or phi_pi <= 1 or sigma_u <= 0:
return -np.inf
lp = beta_dist.logpdf(kappa, 2, 10)
lp += norm.logpdf(phi_pi, 1.5, 0.5)
lp += -2*np.log(sigma_u) - 0.1/sigma_u**2 # IG(0.1, 2) log-density
return lp
def dsge_kalman_loglik(theta, Y_obs):
"""Evaluate NK model log-likelihood via Kalman filter."""
kappa, phi_pi, sigma_u = theta
beta_NK, sigma_IS = 0.99, 1.0
phi_y = 0.5
# Build model matrices
G0 = np.array([[1.0, -kappa], [sigma_IS*phi_pi, 1+sigma_IS*phi_y]])
G1 = np.array([[beta_NK, 0.0], [-sigma_IS, 1.0]])
A = np.linalg.solve(G0, G1)
eigs = np.linalg.eigvals(A)
if not np.all(np.abs(eigs) > 1.0):
return -np.inf # indeterminate
# MSV solution: [pi; x] = Omega * u_t, u_t = rho_u * u_{t-1} + eps_t
rho_u = 0.7; Psi = np.linalg.solve(G0, np.array([1.0, 0.0]))
Omega = np.linalg.solve(np.eye(2) - rho_u*A, Psi)
# State-space: alpha_t = [pi_t, x_t, u_t]
F = np.zeros((3,3)); F[:2,:2] = A; F[2,2] = rho_u
F[:2, 2] = Psi
H_obs = np.array([[1,0,0], [0,0,0]]) # only pi observed, x latent
H_obs = np.eye(2)[:, :2] @ np.eye(2) # simplify: observe pi only
H_obs = np.array([[1,0,0]]) # observe pi only
Q = np.zeros((3,3)); Q[2,2] = sigma_u**2
R = np.array([[0.01]]) # small measurement error
# Kalman filter
T = len(Y_obs); m = F.shape[0]; p = H_obs.shape[0]
a = np.zeros(m); P = np.eye(m) * 10.0 # diffuse initialization
log_lik = 0.0
for t in range(T):
a_pred = F @ a
P_pred = F @ P @ F.T + Q
v = Y_obs[t:t+1] - H_obs @ a_pred
Fv = H_obs @ P_pred @ H_obs.T + R
K = P_pred @ H_obs.T @ np.linalg.inv(Fv)
sign, ld = np.linalg.slogdet(Fv)
log_lik -= 0.5*(ld + v @ np.linalg.solve(Fv, v) + p*np.log(2*np.pi))
a = a_pred + K @ v
P = (np.eye(m) - K @ H_obs) @ P_pred
return float(log_lik)
def log_posterior(theta, Y_obs):
lp = log_prior(theta)
if not np.isfinite(lp): return -np.inf
ll = dsge_kalman_loglik(theta, Y_obs)
return lp + ll if np.isfinite(ll) else -np.inf
# Generate synthetic data
np.random.seed(42)
T_data = 200; kappa_t, phi_pi_t, sigma_u_t = 0.15, 1.5, 0.01
u = np.zeros(T_data)
for t in range(1,T_data): u[t] = 0.7*u[t-1] + sigma_u_t*np.random.randn()
# Simulate pi: approximate first-order solution
beta_NK, sigma_IS, phi_y = 0.99, 1.0, 0.5
G0_t = np.array([[1,-kappa_t],[sigma_IS*phi_pi_t,1+sigma_IS*phi_y]])
G1_t = np.array([[beta_NK,0],[-sigma_IS,1]]); A_t = np.linalg.inv(G0_t)@G1_t
Omega_t = np.linalg.solve(np.eye(2)-0.7*A_t, np.linalg.solve(G0_t,np.array([1,0])))
pi_data = Omega_t[0]*u + 0.003*np.random.randn(T_data)
# MH sampling
N_draws = 5000; burnin = 1000
theta_chain = np.zeros((N_draws, 3))
theta_chain[0] = [0.12, 1.6, 0.012] # starting point
Sigma_prop = np.diag([0.01**2, 0.05**2, 0.002**2]) # proposal covariance
L_prop = np.linalg.cholesky(Sigma_prop)
lp_curr = log_posterior(theta_chain[0], pi_data)
n_accept = 0
for s in range(1, N_draws):
proposal = theta_chain[s-1] + L_prop @ np.random.randn(3)
lp_prop = log_posterior(proposal, pi_data)
log_alpha = lp_prop - lp_curr
if np.log(np.random.rand()) < log_alpha:
theta_chain[s] = proposal
lp_curr = lp_prop
n_accept += 1
else:
theta_chain[s] = theta_chain[s-1]
accept_rate = n_accept / N_draws
posterior = theta_chain[burnin:]
print(f"Acceptance rate: {accept_rate*100:.1f}% (target: 20-30%)")
print(f"\nPosterior estimates (true values in brackets):")
params = ['kappa', 'phi_pi', 'sigma_u']
true_vals = [kappa_t, phi_pi_t, sigma_u_t]
for i, (p, tv) in enumerate(zip(params, true_vals)):
pm = np.mean(posterior[:,i]); ps = np.std(posterior[:,i])
ci = np.percentile(posterior[:,i], [5, 95])
print(f" {p}: mean={pm:.4f}, std={ps:.4f}, 90%CI=[{ci[0]:.4f},{ci[1]:.4f}] (true={tv})")30.9 Programming Exercises¶
Exercise 30.1 (APL — Gelman–Rubin Statistic)¶
Run two independent MH chains from different starting points ( and ) for the simple Gaussian posterior. Compute the Gelman–Rubin statistic: (a) within-chain variance using (+/S_j_sq)÷m; (b) between-chain variance ; (c) . Verify as the chain length increases and the two chains converge.
Exercise 30.2 (Python — Adaptive MH)¶
Implement adaptive MH (Haario, Saksman, Tamminen, 2001): after a warmup of draws, update the proposal covariance as where is the dimension. (a) Show that this targets the optimal proposal covariance for Gaussian posteriors. (b) Compare acceptance rates between fixed and adaptive proposals for the 3-parameter NK model. (c) Verify the adaptive chain converges to the same posterior as the fixed chain.
Exercise 30.3 (Julia — Marginal Likelihood via Laplace)¶
using LinearAlgebra, ForwardDiff, Optim
function laplace_marginal_lik(log_posterior, theta_mode)
# Compute negative Hessian at mode
H = ForwardDiff.hessian(theta -> -log_posterior(theta), theta_mode)
k = length(theta_mode)
# Laplace approximation
lml = log_posterior(theta_mode) + (k/2)*log(2π) - 0.5*log(det(H))
return lml
end
# Example: simple 2-parameter model
log_post(θ) = -0.5*((θ[1]-1.5)^2/0.25 + (θ[2]-0.15)^2/0.01) # Gaussian posterior
θ_mode = [1.5, 0.15]
lml = laplace_marginal_lik(log_post, θ_mode)
println("Log marginal likelihood (Laplace): ", round(lml, digits=4))
println("Exact (Gaussian): ", -log(2π) - 0.5*log(0.25*0.01))Exercise 30.4 — Prior Sensitivity ()¶
For the 3-parameter NK model, investigate prior sensitivity. (a) Run the MH estimator with three different priors for : informative Beta(5,20) (mean 0.20), weakly informative Beta(2,10) (mean 0.17), and very flat Beta(1,1). (b) Report the posterior means and standard deviations for all three priors. (c) Compute the Bayes factor between the informative and flat prior specifications. (d) Interpret: is the data informative enough to overcome the prior, or does the posterior depend strongly on the prior specification?
30.10 Chapter Summary¶
Key results:
The Bayesian posterior combines the Kalman filter likelihood with prior distributions over model parameters.
Standard priors for DSGE: Beta for fractions (, ); Gamma/Inverse-Gamma for positive parameters (, , shock std devs); Normal (truncated) for Taylor rule coefficients.
The Metropolis–Hastings algorithm generates posterior draws: propose ; accept with probability . Invariance to the target distribution proved via detailed balance (Theorem 30.1).
Convergence diagnostics: trace plots, effective sample size , and Gelman–Rubin .
The Laplace approximation to the marginal likelihood: .
In APL: MH chain as a scan
{mh_step ⍵ scale}\ N_draws ⍴ theta0; acceptance check vialog_ratio > ⍟?0(accept when ).
Next: Chapter 31 — Implementing DSGE Models: Workflows in Dynare, APL, Python, Julia, and R