Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Finding Optimal Policy Rules

“Every optimizer is Newton’s method in disguise. The question is which Hessian approximation you’re using.”

Cross-reference: Principles Ch. 23 (optimal monetary policy, loss function, commitment vs. discretion); Ch. 28 (optimal fiscal policy) [P:Ch.23, P:Ch.28]


24.1 The Optimization Problem in Macroeconomic Policy

Chapter 1 derived the first-order conditions for optimization problems with analytical gradients. In practice, the objective function is often the result of a simulation — the welfare loss generated by a particular Taylor rule coefficient, computed by running the DSGE model forward. These simulation-based objectives have no closed-form gradient: they must be optimized numerically.

Three distinct applications motivate the algorithms of this chapter:

Optimal Taylor rule: Given the NK welfare loss function L=Et=0βt[πt2+λxt2]\mathcal{L} = \mathbb{E}\sum_{t=0}^\infty\beta^t[\pi_t^2 + \lambda x_t^2], find the Taylor rule coefficients (ϕπ,ϕy)(\phi_\pi^*, \phi_y^*) that minimize L\mathcal{L} subject to the DSGE equilibrium conditions. The objective is a 2-variable function that can be evaluated by solving the linearized model (Chapter 28) and computing the theoretical second moments.

DSGE parameter estimation: Find θ^MLE\hat{\boldsymbol\theta}_{MLE} that maximizes the log-likelihood L(θ)=12t[lnFt+vtFt1vt]\mathcal{L}(\boldsymbol\theta) = -\frac{1}{2}\sum_t[\ln|F_t|+v_t'F_t^{-1}v_t] (Chapter 20). This is a kk-dimensional optimization where kk can be 30–50 for a medium-scale DSGE.

Calibration moment matching: Find parameters θ^\hat{\boldsymbol\theta} that minimize a distance between model-implied and data moments: minθjwj(mjmodel(θ)mjdata)2\min_{\boldsymbol\theta}\sum_j w_j(m_j^{model}(\boldsymbol\theta)-m_j^{data})^2.


24.2 Unconstrained Optimization Algorithms

24.2.1 Gradient Descent

The simplest first-order method: xn+1=xnαnf(xn)\mathbf{x}_{n+1} = \mathbf{x}_n - \alpha_n\nabla f(\mathbf{x}_n), where αn>0\alpha_n > 0 is the step size (learning rate).

Theorem 24.1 (Convergence of Gradient Descent). If ff is μ\mu-strongly convex (f(y)f(x)+f(x)(yx)+μ2yx2f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})'(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2) and LL-smooth (f(x)f(y)Lxy\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x}-\mathbf{y}\|), then with step size α=1/L\alpha = 1/L:

f(xn)f(x)(1μL)n[f(x0)f(x)].f(\mathbf{x}_n) - f(\mathbf{x}^*) \leq \left(1 - \frac{\mu}{L}\right)^n [f(\mathbf{x}_0) - f(\mathbf{x}^*)].

Convergence is linear with rate 1μ/L1 - \mu/L. The ratio κ=L/μ\kappa = L/\mu (condition number) governs convergence: poorly conditioned problems (κ1\kappa \gg 1) converge slowly.

Practical limitation for macroeconomics: Gradient descent requires many iterations (linear convergence) and is rarely used directly in DSGE work. It appears in machine learning applications (neural network policy function approximation).

24.2.2 Newton’s Method for Optimization

Newton’s method replaces the gradient step with a Newton step:

xn+1=xn[Hf(xn)]1f(xn),\mathbf{x}_{n+1} = \mathbf{x}_n - [H_f(\mathbf{x}_n)]^{-1}\nabla f(\mathbf{x}_n),

where HfH_f is the Hessian matrix. This is exactly root-finding applied to f(x)=0\nabla f(\mathbf{x}) = \mathbf{0} (the FOC), using Newton’s method from Chapter 22.

Theorem 24.2 (Quadratic Convergence of Newton’s Optimization Method). Near a local minimum x\mathbf{x}^* with Hf(x)H_f(\mathbf{x}^*) positive definite and HfH_f Lipschitz:

xn+1xCxnx2.\|\mathbf{x}_{n+1} - \mathbf{x}^*\| \leq C\|\mathbf{x}_n - \mathbf{x}^*\|^2.

Quadratic convergence — same as root-finding Newton. But computing and inverting the k×kk\times k Hessian costs O(k3)O(k^3) per iteration — prohibitive for large kk (DSGE with 40+ parameters).

24.2.3 The BFGS Algorithm

BFGS (Broyden–Fletcher–Goldfarb–Shanno, 1970) maintains an approximation HnHf(xn)1H_n \approx H_f(\mathbf{x}_n)^{-1} and updates it cheaply using two gradient evaluations.

Algorithm 24.1 (BFGS).

Initialize: H0=IH_0 = I (or any positive definite matrix), x0\mathbf{x}_0.

For n=0,1,2,n = 0, 1, 2, \ldots:

  1. Compute gradient gn=f(xn)\mathbf{g}_n = \nabla f(\mathbf{x}_n).

  2. Compute search direction: dn=Hngn\mathbf{d}_n = -H_n\mathbf{g}_n.

  3. Line search: find step size αn>0\alpha_n > 0 satisfying the Wolfe conditions.

  4. Update: xn+1=xn+αndn\mathbf{x}_{n+1} = \mathbf{x}_n + \alpha_n\mathbf{d}_n.

  5. Compute sn=xn+1xn\mathbf{s}_n = \mathbf{x}_{n+1} - \mathbf{x}_n and yn=gn+1gn\mathbf{y}_n = \mathbf{g}_{n+1} - \mathbf{g}_n.

  6. BFGS Hessian inverse update:

    Hn+1=(Iρnsnyn)Hn(Iρnynsn)+ρnsnsn,H_{n+1} = \left(I - \rho_n\mathbf{s}_n\mathbf{y}_n'\right)H_n\left(I - \rho_n\mathbf{y}_n\mathbf{s}_n'\right) + \rho_n\mathbf{s}_n\mathbf{s}_n',

    where ρn=1/(ynsn)\rho_n = 1/(\mathbf{y}_n'\mathbf{s}_n).

Theorem 24.3 (BFGS Update Derivation). The BFGS update is the solution to:

Hn+1=argminHHHnWs.t.Hyn=sn,H=H,H_{n+1} = \arg\min_{H}\|H - H_n\|_W \quad \text{s.t.} \quad H\mathbf{y}_n = \mathbf{s}_n, \quad H = H',

where W\|\cdot\|_W is the weighted Frobenius norm. The secant condition Hn+1yn=snH_{n+1}\mathbf{y}_n = \mathbf{s}_n encodes the quasi-Newton requirement that the inverse Hessian approximation satisfies H2fIH\nabla^2 f \approx I along the last step direction.

Proof. The secant condition: sn=xn+1xn\mathbf{s}_n = \mathbf{x}_{n+1}-\mathbf{x}_n and yn=f(xn+1)f(xn)Hf(xn)sn\mathbf{y}_n = \nabla f(\mathbf{x}_{n+1})-\nabla f(\mathbf{x}_n) \approx H_f(\mathbf{x}_n)\mathbf{s}_n (Taylor). So Hn+1yn=snH_{n+1}\mathbf{y}_n = \mathbf{s}_n encodes H1HfsnsnH^{-1}H_f\mathbf{s}_n \approx \mathbf{s}_n, i.e., HHf1H \approx H_f^{-1} along the step direction. Minimizing the symmetric rank-2 change to HnH_n subject to this gives the BFGS formula. \square

In APL, the BFGS update is a concise expression:

⍝ APL — BFGS Hessian inverse update
⎕IO←0 ⋄ ⎕ML←1

bfgs_update ← {H s y ← ⍵
    rho ← ÷ s+.×y                          ⍝ ρ = 1/(y's)
    n   ← ≢s
    I_n ← =⍨⍳n                             ⍝ identity
    A   ← I_n - rho × s ∘.× y              ⍝ I - ρ s y'
    B   ← I_n - rho × y ∘.× s              ⍝ I - ρ y s'
    (A +.× H +.× B) + rho × s ∘.× s}      ⍝ AHB + ρ ss'

⍝ One BFGS iteration
bfgs_step ← {f grad H x ← ⍵
    g ← grad x                              ⍝ gradient at x
    d ← (-H) +.× g                          ⍝ search direction
    ⍝ Step size via backtracking line search
    alpha ← 1
    alpha_step ← {a ← ⍵ ⋄ (f x+a×d)>(f x) - 1e¯4×a×g+.×d: a×0.5 ⋄ a}
    alpha ← alpha_step ⍣ (1e¯10∘<⊢) ⊢ 1    ⍝ backtrack until Armijo holds
    x_new ← x + alpha × d
    g_new ← grad x_new
    s ← x_new - x
    y ← g_new - g
    H_new ← bfgs_update H s y
    x_new H_new}

L-BFGS (Limited-memory BFGS): For large problems (k>100k > 100), storing the full k×kk\times k matrix HnH_n is expensive. L-BFGS stores only the last m=5m = 520 pairs (si,yi)(\mathbf{s}_i, \mathbf{y}_i) and reconstructs the search direction HngnH_n\mathbf{g}_n via a two-loop recursion in O(mk)O(mk) time. L-BFGS is the standard algorithm for DSGE MLE and Bayesian posterior mode finding.


24.3 Constrained Optimization

24.3.1 Interior Point Methods

For constrained problems minf(x)\min f(\mathbf{x}) s.t. gj(x)0g_j(\mathbf{x}) \leq 0, the interior point (barrier) method solves a sequence of unconstrained problems:

minx  f(x)μjln(gj(x)),\min_{\mathbf{x}}\; f(\mathbf{x}) - \mu\sum_j\ln(-g_j(\mathbf{x})),

for decreasing μ0\mu \to 0. The log-barrier μln(gj)-\mu\ln(-g_j) goes to ++\infty as gj0g_j \to 0, keeping iterates in the interior of the feasible region.

Application: In DSGE calibration, parameters must be in [0,1][0,1] (probabilities) or positive (variances). Interior point methods enforce these automatically by reparameterizing via logarithms or logistic transforms.

24.3.2 Sequential Quadratic Programming (SQP)

SQP solves constrained problems by iteratively solving quadratic programming (QP) subproblems. At each iteration, the objective and constraints are approximated by quadratics and linearizations respectively; the QP gives the next search direction. SQP is the state-of-the-art method for smooth constrained optimization.


24.4 Global Optimization: Avoiding Local Minima

All gradient-based methods find local optima. For non-convex objectives — common in DSGE estimation where the likelihood may be multimodal — local methods fail to find the global optimum.

Simulated annealing: A stochastic search that accepts uphill moves with probability exp(Δf/T)\exp(-\Delta f/T) where TT (temperature) decreases over time. As T0T \to 0, the algorithm concentrates on the current best region.

Differential evolution: A population-based method maintaining NN candidate solutions, evolving them via mutation, crossover, and selection. Robust for multimodal landscapes.

Strategy for DSGE estimation: Run L-BFGS from K=50K = 50200 random starting points; take the best result. Alternatively, use simulated annealing for global search, then L-BFGS for local refinement. Both approaches are implemented in Dynare’s estimation command.


24.5 Optimal Monetary Policy: Solving for Optimal Taylor Rule Coefficients

Cross-reference: Principles Ch. 23.2 (optimal monetary policy under commitment and discretion) [P:Ch.23.2]

The central bank minimizes the quadratic loss function:

L(ϕπ,ϕy)=Var(π^)+λxVar(x^)+λiVar(i^),\mathcal{L}(\phi_\pi, \phi_y) = \text{Var}(\hat\pi) + \lambda_x\text{Var}(\hat x) + \lambda_i\text{Var}(\hat i),

where the variances are the model-implied theoretical second moments under the Taylor rule i^t=ϕππ^t+ϕyx^t\hat i_t = \phi_\pi\hat\pi_t + \phi_y\hat x_t.

Computing theoretical second moments from the linearized model: Given the MSV solution (π^t,x^t)=Ωzt(\hat\pi_t, \hat x_t)' = \Omega z_t (from Chapter 18) and zt=Φzt1+εtz_t = \Phi z_{t-1} + \varepsilon_t, the theoretical variance-covariance matrix satisfies the discrete Lyapunov equation:

Σy=ΩΣzΩ,Σz=ΦΣzΦ+Q,\Sigma_y = \Omega\Sigma_z\Omega', \quad \Sigma_z = \Phi\Sigma_z\Phi' + Q,

where Q=Cov(εt)Q = \text{Cov}(\varepsilon_t). The Lyapunov equation for Σz\Sigma_z has the unique solution:

vec(Σz)=(IΦΦ)1vec(Q).\text{vec}(\Sigma_z) = (I - \Phi\otimes\Phi)^{-1}\text{vec}(Q).

In APL: vec_Sigma ← vec_Q ⌹ I - Phi kron Phi where kron is the Kronecker product.

Python

import numpy as np
from scipy.optimize import minimize

# Optimal Taylor rule coefficients: minimize welfare loss
beta, kappa, sigma, lambda_x, lambda_i = 0.99, 0.15, 1.0, 0.1, 0.05

def solve_nk_moments(phi_pi, phi_y, rho_u=0.5, sigma_u=0.01):
    """Compute variance of pi, x, i for given Taylor rule."""
    G0 = np.array([[1, -kappa], [sigma*phi_pi, 1+sigma*phi_y]])
    G1 = np.array([[beta, 0], [-sigma, 1]])
    try:
        A = np.linalg.inv(G0) @ G1
        # Check eigenvalues (determinacy)
        eigs = np.linalg.eigvals(A)
        if not np.all(np.abs(eigs) > 1): return 1e10, 1e10, 1e10
    except: return 1e10, 1e10, 1e10
    
    # MSV solution: [pi_t, x_t]' = Omega * u_t
    Psi = np.linalg.inv(G0) @ np.array([[1], [0]])  # u shock loading
    I2 = np.eye(2)
    KronPhi = np.array([[rho_u**2]])  # scalar case: Phi = rho_u
    vec_Sigma_z = sigma_u**2 / (1 - rho_u**2)  # AR(1) variance
    
    Omega = np.linalg.solve(I2 - rho_u*A, Psi)  # Omega from Sylvester: scalar case
    Sigma_y = Omega @ np.array([[vec_Sigma_z]]) @ Omega.T
    
    var_pi = Sigma_y[0,0]; var_x = Sigma_y[1,1]
    # Interest rate variance
    i_coefs = phi_pi*Omega[0,0] + phi_y*Omega[1,0]
    var_i = i_coefs**2 * vec_Sigma_z
    return var_pi, var_x, var_i

def welfare_loss(params):
    phi_pi, phi_y = params
    if phi_pi < 0.01 or phi_y < 0: return 1e10
    var_pi, var_x, var_i = solve_nk_moments(phi_pi, phi_y)
    return var_pi + lambda_x*var_x + lambda_i*var_i

# Optimize over (phi_pi, phi_y)
result = minimize(welfare_loss, [1.5, 0.5], method='Nelder-Mead',
                  options={'xatol': 1e-8, 'fatol': 1e-8})
phi_pi_opt, phi_y_opt = result.x
print(f"Optimal Taylor rule: φ_π = {phi_pi_opt:.3f}, φ_y = {phi_y_opt:.3f}")

# Compare to standard Taylor rule (1.5, 0.5)
loss_std = welfare_loss([1.5, 0.5])
loss_opt = welfare_loss([phi_pi_opt, phi_y_opt])
print(f"Welfare loss: standard={loss_std:.6f}, optimal={loss_opt:.6f}")
print(f"Welfare gain from optimization: {100*(loss_std-loss_opt)/loss_std:.2f}%")

# Sensitivity: plot loss surface over (phi_pi, phi_y) grid
phi_pi_grid = np.linspace(1.01, 5.0, 30)
phi_y_grid  = np.linspace(0.0, 2.0, 30)
PP, PY = np.meshgrid(phi_pi_grid, phi_y_grid)
loss_grid = np.vectorize(welfare_loss)(np.column_stack([PP.ravel(), PY.ravel()]))
loss_grid = loss_grid.reshape(PP.shape)

import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.contourf(PP, PY, np.log(loss_grid+1e-10), levels=20, cmap='RdYlGn_r')
plt.colorbar(label='log(welfare loss)')
plt.plot(phi_pi_opt, phi_y_opt, 'w*', markersize=15, label=f'Optimal ({phi_pi_opt:.2f},{phi_y_opt:.2f})')
plt.plot(1.5, 0.5, 'wx', markersize=12, label='Standard (1.5, 0.5)')
plt.xlabel('φ_π'); plt.ylabel('φ_y')
plt.title('Welfare loss as function of Taylor rule coefficients')
plt.legend(); plt.tight_layout(); plt.show()
using Optim, LinearAlgebra

beta, kappa, sigma = 0.99, 0.15, 1.0
lambda_x, lambda_i = 0.1, 0.05

function welfare_loss(params; rho_u=0.5, sig_u=0.01)
    phi_pi, phi_y = params
    phi_pi < 1.0 && return 1e10  # enforce Taylor principle
    G0 = [1 -kappa; sigma*phi_pi 1+sigma*phi_y]
    G1 = [beta 0; -sigma 1]
    A = inv(G0)*G1
    all(abs.(eigvals(A)) .> 1) || return 1e10
    # MSV solution: scalar u shock
    Psi = inv(G0)*[1.0; 0.0]
    Omega = (I(2) - rho_u*A) \ Psi  # 2×1
    sig_z2 = sig_u^2 / (1 - rho_u^2)
    var_pi = Omega[1]^2 * sig_z2
    var_x  = Omega[2]^2 * sig_z2
    var_i  = (phi_pi*Omega[1] + phi_y*Omega[2])^2 * sig_z2
    return var_pi + lambda_x*var_x + lambda_i*var_i
end

res = optimize(p -> welfare_loss(p), [1.5, 0.5], NelderMead(),
               Optim.Options(x_tol=1e-8, f_tol=1e-10))
phi_opt = Optim.minimizer(res)
println("Optimal Taylor rule: φ_π=$(round(phi_opt[1],digits=3)), φ_y=$(round(phi_opt[2],digits=3))")
println("Welfare gain vs standard: $(round(100*(welfare_loss([1.5,0.5])-Optim.minimum(res))/welfare_loss([1.5,0.5]),digits=2))%")
library(nloptr)

beta<-0.99; kappa<-0.15; sigma<-1.0; lambda_x<-0.1; lambda_i<-0.05

welfare_loss <- function(params) {
  phi_pi<-params[1]; phi_y<-params[2]
  if(phi_pi<1.0) return(1e10)
  G0<-matrix(c(1,sigma*phi_pi,-kappa,1+sigma*phi_y),2,2)
  G1<-matrix(c(0.99,0,-sigma,1),2,2); rho_u<-0.5; sig_u<-0.01
  A<-solve(G0)%*%G1
  if(!all(Mod(eigen(A)$values)>1)) return(1e10)
  Psi<-solve(G0)%*%c(1,0)
  Omega<-solve(diag(2)-rho_u*A)%*%Psi
  sig_z2<-sig_u^2/(1-rho_u^2)
  var_pi<-Omega[1]^2*sig_z2; var_x<-Omega[2]^2*sig_z2
  var_i<-(phi_pi*Omega[1]+phi_y*Omega[2])^2*sig_z2
  var_pi + lambda_x*var_x + lambda_i*var_i
}

res <- nloptr(c(1.5,0.5), eval_f=welfare_loss,
              lb=c(1.01,0), ub=c(10,5),
              opts=list(algorithm="NLOPT_LN_NELDERMEAD",xtol_rel=1e-10,maxeval=5000))
cat(sprintf("Optimal: φ_π=%.3f, φ_y=%.3f\n", res$solution[1], res$solution[2]))

24.6 Programming Exercises

Exercise 24.1 (APL — BFGS from Scratch)

Implement the full BFGS algorithm in APL: (a) backtracking line search satisfying the Armijo condition; (b) BFGS update using the dfn from Section 24.2.3; (c) convergence check when ⌈/|grad x| < 1e¯6. Test on the 2D Rosenbrock function f(x,y)=(1x)2+100(yx2)2f(x,y) = (1-x)^2 + 100(y-x^2)^2 starting from (0,0)(0, 0).

Exercise 24.2 (Python — Multiple Starting Points)

The NK welfare loss function can be multimodal in (ϕπ,ϕy)(\phi_\pi, \phi_y) space when the model is near the boundary of determinacy. (a) Grid-search over (ϕπ,ϕy)[1.01,6]×[0,3](\phi_\pi, \phi_y) \in [1.01, 6]\times[0, 3] with 20×2020\times20 points to map the loss surface. (b) Run BFGS from 50 random starting points inside the determinate region. (c) Verify that all starting points converge to the same minimum (if the loss is unimodal) or identify multiple local minima.

Exercise 24.3 (Julia — L-BFGS for DSGE Estimation)

using Optim, ForwardDiff

# Minimize NK log-likelihood (from Chapter 20 Kalman filter)
# Here: simplified version with analytical gradients via AD
function neg_log_lik_nk(params)
    # params = [phi_pi, phi_y, rho_u, sigma_u, sigma_obs]
    phi_pi, phi_y, rho_u, sigma_u, sigma_obs = exp.(params)  # log-parameterized
    # ... (Kalman filter evaluation from Ch.20)
    # Placeholder: return quadratic bowl for illustration
    return sum((params .- [log(1.5), log(0.5), log(0.7), log(0.01), log(0.005)]).^2)
end

# L-BFGS via Optim.jl
x0 = log.([1.5, 0.5, 0.7, 0.01, 0.005])
res = optimize(neg_log_lik_nk, x0, LBFGS(),
               Optim.Options(g_tol=1e-8, iterations=1000, show_trace=false))
println("L-BFGS converged: $(Optim.converged(res)), iterations: $(Optim.iterations(res))")
println("Parameters: $(round.(exp.(Optim.minimizer(res)), digits=4))")

Exercise 24.4 — Optimal Commitment Policy (\star)

Under commitment, the central bank minimizes L=Et=0βt(πt2+λxxt2)\mathcal{L} = \mathbb{E}\sum_{t=0}^\infty\beta^t(\pi_t^2 + \lambda_x x_t^2) choosing the path {it}\{i_t\} subject to the NKPC and DIS. (a) Derive the first-order conditions (targeting rules) from the Lagrangian with multipliers for the NKPC and DIS constraints. (b) Show the commitment optimum satisfies πt=(λx/κ)(xtxt1)\pi_t = -(\lambda_x/\kappa)(x_t - x_{t-1}) — the price-level targeting result. (c) Numerically compare the welfare loss under commitment, discretion (time-consistent, ignoring the Euler equation constraint in each period), and the optimal Taylor rule.


24.7 Chapter Summary

Key results:

  • Gradient descent converges linearly at rate 1μ/L1-\mu/L (condition number κ=L/μ\kappa=L/\mu governs convergence speed).

  • Newton’s method for optimization (xn+1=xnHf1f\mathbf{x}_{n+1} = \mathbf{x}_n - H_f^{-1}\nabla f) achieves quadratic convergence (Theorem 24.2) but costs O(k3)O(k^3) per iteration.

  • BFGS maintains a rank-2 update of the inverse Hessian approximation: Hn+1=(Iρsy)Hn(Iρys)+ρssH_{n+1} = (I-\rho\mathbf{s}\mathbf{y}')H_n(I-\rho\mathbf{y}\mathbf{s}') + \rho\mathbf{s}\mathbf{s}' (derived as minimum-change update, Theorem 24.3). Superlinear convergence; O(k2)O(k^2) per iteration.

  • L-BFGS stores only the last mm pairs (si,yi)(\mathbf{s}_i,\mathbf{y}_i); O(mk)O(mk) per iteration; standard for large DSGE problems.

  • Global optimization strategies (multiple starts, simulated annealing) are essential for multimodal DSGE likelihoods.

  • Optimal Taylor rule found by minimizing the discrete Lyapunov welfare loss over (ϕπ,ϕy)(\phi_\pi,\phi_y); the Lyapunov equation for variances is vec_Sigma ← vec_Q ⌹ I - Phi kron Phi in APL.

Next: Chapter 25 — Solving Systems of Equations: From Linear Systems to Sparse Matrices