Chapter 24: Numerical Optimization

Finding Optimal Policy Rules

“Every optimizer is Newton’s method in disguise. The question is which Hessian approximation you’re using.”

Cross-reference: Principles Ch. 23 (optimal monetary policy, loss function, commitment vs. discretion); Ch. 28 (optimal fiscal policy) [P:Ch.23, P:Ch.28]

24.1 The Optimization Problem in Macroeconomic Policy¶

Chapter 1 derived the first-order conditions for optimization problems with analytical gradients. In practice, the objective function is often the result of a simulation — the welfare loss generated by a particular Taylor rule coefficient, computed by running the DSGE model forward. These simulation-based objectives have no closed-form gradient: they must be optimized numerically.

Three distinct applications motivate the algorithms of this chapter:

Optimal Taylor rule: Given the NK welfare loss function $\mathcal{L} = \mathbb{E}\sum_{t=0}^\infty\beta^t[\pi_t^2 + \lambda x_t^2]$ , find the Taylor rule coefficients $(\phi_\pi^*, \phi_y^*)$ that minimize $\mathcal{L}$ subject to the DSGE equilibrium conditions. The objective is a 2-variable function that can be evaluated by solving the linearized model (Chapter 28) and computing the theoretical second moments.

DSGE parameter estimation: Find $\hat{\boldsymbol\theta}_{MLE}$ that maximizes the log-likelihood $\mathcal{L}(\boldsymbol\theta) = -\frac{1}{2}\sum_t[\ln|F_t|+v_t'F_t^{-1}v_t]$ (Chapter 20). This is a $k$ -dimensional optimization where $k$ can be 30–50 for a medium-scale DSGE.

Calibration moment matching: Find parameters $\hat{\boldsymbol\theta}$ that minimize a distance between model-implied and data moments: $\min_{\boldsymbol\theta}\sum_j w_j(m_j^{model}(\boldsymbol\theta)-m_j^{data})^2$ .

24.2 Unconstrained Optimization Algorithms¶

24.2.1 Gradient Descent¶

The simplest first-order method: $\mathbf{x}_{n+1} = \mathbf{x}_n - \alpha_n\nabla f(\mathbf{x}_n)$ , where $\alpha_n > 0$ is the step size (learning rate).

Theorem 24.1 (Convergence of Gradient Descent). If $f$ is $\mu$ -strongly convex ( $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})'(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2$ ) and $L$ -smooth ( $\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x}-\mathbf{y}\|$ ), then with step size $\alpha = 1/L$ :

f(\mathbf{x}_n) - f(\mathbf{x}^*) \leq \left(1 - \frac{\mu}{L}\right)^n [f(\mathbf{x}_0) - f(\mathbf{x}^*)].

(1)

Convergence is linear with rate $1 - \mu/L$ . The ratio $\kappa = L/\mu$ (condition number) governs convergence: poorly conditioned problems ( $\kappa \gg 1$ ) converge slowly.

Practical limitation for macroeconomics: Gradient descent requires many iterations (linear convergence) and is rarely used directly in DSGE work. It appears in machine learning applications (neural network policy function approximation).

24.2.2 Newton’s Method for Optimization¶

Newton’s method replaces the gradient step with a Newton step:

\mathbf{x}_{n+1} = \mathbf{x}_n - [H_f(\mathbf{x}_n)]^{-1}\nabla f(\mathbf{x}_n),

(2)

where $H_f$ is the Hessian matrix. This is exactly root-finding applied to $\nabla f(\mathbf{x}) = \mathbf{0}$ (the FOC), using Newton’s method from Chapter 22.

Theorem 24.2 (Quadratic Convergence of Newton’s Optimization Method). Near a local minimum $\mathbf{x}^*$ with $H_f(\mathbf{x}^*)$ positive definite and $H_f$ Lipschitz:

\|\mathbf{x}_{n+1} - \mathbf{x}^*\| \leq C\|\mathbf{x}_n - \mathbf{x}^*\|^2.

(3)

Quadratic convergence — same as root-finding Newton. But computing and inverting the $k\times k$ Hessian costs $O(k^3)$ per iteration — prohibitive for large $k$ (DSGE with 40+ parameters).

24.2.3 The BFGS Algorithm¶

BFGS (Broyden–Fletcher–Goldfarb–Shanno, 1970) maintains an approximation $H_n \approx H_f(\mathbf{x}_n)^{-1}$ and updates it cheaply using two gradient evaluations.

Algorithm 24.1 (BFGS).

Initialize: $H_0 = I$ (or any positive definite matrix), $\mathbf{x}_0$ .

For $n = 0, 1, 2, \ldots$ :

Compute gradient $\mathbf{g}_n = \nabla f(\mathbf{x}_n)$ .
Compute search direction: $\mathbf{d}_n = -H_n\mathbf{g}_n$ .
Line search: find step size $\alpha_n > 0$ satisfying the Wolfe conditions.
Update: $\mathbf{x}_{n+1} = \mathbf{x}_n + \alpha_n\mathbf{d}_n$ .
Compute $\mathbf{s}_n = \mathbf{x}_{n+1} - \mathbf{x}_n$ and $\mathbf{y}_n = \mathbf{g}_{n+1} - \mathbf{g}_n$ .
BFGS Hessian inverse update:
$H_{n+1} = \left(I - \rho_n\mathbf{s}_n\mathbf{y}_n'\right)H_n\left(I - \rho_n\mathbf{y}_n\mathbf{s}_n'\right) + \rho_n\mathbf{s}_n\mathbf{s}_n',$
(4)
where $\rho_n = 1/(\mathbf{y}_n'\mathbf{s}_n)$ .

Theorem 24.3 (BFGS Update Derivation). The BFGS update is the solution to:

H_{n+1} = \arg\min_{H}\|H - H_n\|_W \quad \text{s.t.} \quad H\mathbf{y}_n = \mathbf{s}_n, \quad H = H',

(5)

where $\|\cdot\|_W$ is the weighted Frobenius norm. The secant condition $H_{n+1}\mathbf{y}_n = \mathbf{s}_n$ encodes the quasi-Newton requirement that the inverse Hessian approximation satisfies $H\nabla^2 f \approx I$ along the last step direction.

Proof. The secant condition: $\mathbf{s}_n = \mathbf{x}_{n+1}-\mathbf{x}_n$ and $\mathbf{y}_n = \nabla f(\mathbf{x}_{n+1})-\nabla f(\mathbf{x}_n) \approx H_f(\mathbf{x}_n)\mathbf{s}_n$ (Taylor). So $H_{n+1}\mathbf{y}_n = \mathbf{s}_n$ encodes $H^{-1}H_f\mathbf{s}_n \approx \mathbf{s}_n$ , i.e., $H \approx H_f^{-1}$ along the step direction. Minimizing the symmetric rank-2 change to $H_n$ subject to this gives the BFGS formula. $\square$

In APL, the BFGS update is a concise expression:

⍝ APL — BFGS Hessian inverse update
⎕IO←0 ⋄ ⎕ML←1

bfgs_update ← {H s y ← ⍵
    rho ← ÷ s+.×y                          ⍝ ρ = 1/(y's)
    n   ← ≢s
    I_n ← =⍨⍳n                             ⍝ identity
    A   ← I_n - rho × s ∘.× y              ⍝ I - ρ s y'
    B   ← I_n - rho × y ∘.× s              ⍝ I - ρ y s'
    (A +.× H +.× B) + rho × s ∘.× s}      ⍝ AHB + ρ ss'

⍝ One BFGS iteration
bfgs_step ← {f grad H x ← ⍵
    g ← grad x                              ⍝ gradient at x
    d ← (-H) +.× g                          ⍝ search direction
    ⍝ Step size via backtracking line search
    alpha ← 1
    alpha_step ← {a ← ⍵ ⋄ (f x+a×d)>(f x) - 1e¯4×a×g+.×d: a×0.5 ⋄ a}
    alpha ← alpha_step ⍣ (1e¯10∘<⊢) ⊢ 1    ⍝ backtrack until Armijo holds
    x_new ← x + alpha × d
    g_new ← grad x_new
    s ← x_new - x
    y ← g_new - g
    H_new ← bfgs_update H s y
    x_new H_new}

L-BFGS (Limited-memory BFGS): For large problems ( $k > 100$ ), storing the full $k\times k$ matrix $H_n$ is expensive. L-BFGS stores only the last $m = 5$ –20 pairs $(\mathbf{s}_i, \mathbf{y}_i)$ and reconstructs the search direction $H_n\mathbf{g}_n$ via a two-loop recursion in $O(mk)$ time. L-BFGS is the standard algorithm for DSGE MLE and Bayesian posterior mode finding.

24.3 Constrained Optimization¶

24.3.1 Interior Point Methods¶

For constrained problems $\min f(\mathbf{x})$ s.t. $g_j(\mathbf{x}) \leq 0$ , the interior point (barrier) method solves a sequence of unconstrained problems:

\min_{\mathbf{x}}\; f(\mathbf{x}) - \mu\sum_j\ln(-g_j(\mathbf{x})),

(6)

for decreasing $\mu \to 0$ . The log-barrier $-\mu\ln(-g_j)$ goes to $+\infty$ as $g_j \to 0$ , keeping iterates in the interior of the feasible region.

Application: In DSGE calibration, parameters must be in $[0,1]$ (probabilities) or positive (variances). Interior point methods enforce these automatically by reparameterizing via logarithms or logistic transforms.

24.3.2 Sequential Quadratic Programming (SQP)¶

SQP solves constrained problems by iteratively solving quadratic programming (QP) subproblems. At each iteration, the objective and constraints are approximated by quadratics and linearizations respectively; the QP gives the next search direction. SQP is the state-of-the-art method for smooth constrained optimization.

24.4 Global Optimization: Avoiding Local Minima¶

All gradient-based methods find local optima. For non-convex objectives — common in DSGE estimation where the likelihood may be multimodal — local methods fail to find the global optimum.

Simulated annealing: A stochastic search that accepts uphill moves with probability $\exp(-\Delta f/T)$ where $T$ (temperature) decreases over time. As $T \to 0$ , the algorithm concentrates on the current best region.

Differential evolution: A population-based method maintaining $N$ candidate solutions, evolving them via mutation, crossover, and selection. Robust for multimodal landscapes.

Strategy for DSGE estimation: Run L-BFGS from $K = 50$ –200 random starting points; take the best result. Alternatively, use simulated annealing for global search, then L-BFGS for local refinement. Both approaches are implemented in Dynare’s estimation command.

24.5 Optimal Monetary Policy: Solving for Optimal Taylor Rule Coefficients¶

Cross-reference: Principles Ch. 23.2 (optimal monetary policy under commitment and discretion) [P:Ch.23.2]

The central bank minimizes the quadratic loss function:

\mathcal{L}(\phi_\pi, \phi_y) = \text{Var}(\hat\pi) + \lambda_x\text{Var}(\hat x) + \lambda_i\text{Var}(\hat i),

(7)

where the variances are the model-implied theoretical second moments under the Taylor rule $\hat i_t = \phi_\pi\hat\pi_t + \phi_y\hat x_t$ .

Computing theoretical second moments from the linearized model: Given the MSV solution $(\hat\pi_t, \hat x_t)' = \Omega z_t$ (from Chapter 18) and $z_t = \Phi z_{t-1} + \varepsilon_t$ , the theoretical variance-covariance matrix satisfies the discrete Lyapunov equation:

\Sigma_y = \Omega\Sigma_z\Omega', \quad \Sigma_z = \Phi\Sigma_z\Phi' + Q,

(8)

where $Q = \text{Cov}(\varepsilon_t)$ . The Lyapunov equation for $\Sigma_z$ has the unique solution:

\text{vec}(\Sigma_z) = (I - \Phi\otimes\Phi)^{-1}\text{vec}(Q).

(9)

In APL: vec_Sigma ← vec_Q ⌹ I - Phi kron Phi where kron is the Kronecker product.

Python¶

import numpy as np
from scipy.optimize import minimize

# Optimal Taylor rule coefficients: minimize welfare loss
beta, kappa, sigma, lambda_x, lambda_i = 0.99, 0.15, 1.0, 0.1, 0.05

def solve_nk_moments(phi_pi, phi_y, rho_u=0.5, sigma_u=0.01):
    """Compute variance of pi, x, i for given Taylor rule."""
    G0 = np.array([[1, -kappa], [sigma*phi_pi, 1+sigma*phi_y]])
    G1 = np.array([[beta, 0], [-sigma, 1]])
    try:
        A = np.linalg.inv(G0) @ G1
        # Check eigenvalues (determinacy)
        eigs = np.linalg.eigvals(A)
        if not np.all(np.abs(eigs) > 1): return 1e10, 1e10, 1e10
    except: return 1e10, 1e10, 1e10
    
    # MSV solution: [pi_t, x_t]' = Omega * u_t
    Psi = np.linalg.inv(G0) @ np.array([[1], [0]])  # u shock loading
    I2 = np.eye(2)
    KronPhi = np.array([[rho_u**2]])  # scalar case: Phi = rho_u
    vec_Sigma_z = sigma_u**2 / (1 - rho_u**2)  # AR(1) variance
    
    Omega = np.linalg.solve(I2 - rho_u*A, Psi)  # Omega from Sylvester: scalar case
    Sigma_y = Omega @ np.array([[vec_Sigma_z]]) @ Omega.T
    
    var_pi = Sigma_y[0,0]; var_x = Sigma_y[1,1]
    # Interest rate variance
    i_coefs = phi_pi*Omega[0,0] + phi_y*Omega[1,0]
    var_i = i_coefs**2 * vec_Sigma_z
    return var_pi, var_x, var_i

def welfare_loss(params):
    phi_pi, phi_y = params
    if phi_pi < 0.01 or phi_y < 0: return 1e10
    var_pi, var_x, var_i = solve_nk_moments(phi_pi, phi_y)
    return var_pi + lambda_x*var_x + lambda_i*var_i

# Optimize over (phi_pi, phi_y)
result = minimize(welfare_loss, [1.5, 0.5], method='Nelder-Mead',
                  options={'xatol': 1e-8, 'fatol': 1e-8})
phi_pi_opt, phi_y_opt = result.x
print(f"Optimal Taylor rule: φ_π = {phi_pi_opt:.3f}, φ_y = {phi_y_opt:.3f}")

# Compare to standard Taylor rule (1.5, 0.5)
loss_std = welfare_loss([1.5, 0.5])
loss_opt = welfare_loss([phi_pi_opt, phi_y_opt])
print(f"Welfare loss: standard={loss_std:.6f}, optimal={loss_opt:.6f}")
print(f"Welfare gain from optimization: {100*(loss_std-loss_opt)/loss_std:.2f}%")

# Sensitivity: plot loss surface over (phi_pi, phi_y) grid
phi_pi_grid = np.linspace(1.01, 5.0, 30)
phi_y_grid  = np.linspace(0.0, 2.0, 30)
PP, PY = np.meshgrid(phi_pi_grid, phi_y_grid)
loss_grid = np.vectorize(welfare_loss)(np.column_stack([PP.ravel(), PY.ravel()]))
loss_grid = loss_grid.reshape(PP.shape)

import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.contourf(PP, PY, np.log(loss_grid+1e-10), levels=20, cmap='RdYlGn_r')
plt.colorbar(label='log(welfare loss)')
plt.plot(phi_pi_opt, phi_y_opt, 'w*', markersize=15, label=f'Optimal ({phi_pi_opt:.2f},{phi_y_opt:.2f})')
plt.plot(1.5, 0.5, 'wx', markersize=12, label='Standard (1.5, 0.5)')
plt.xlabel('φ_π'); plt.ylabel('φ_y')
plt.title('Welfare loss as function of Taylor rule coefficients')
plt.legend(); plt.tight_layout(); plt.show()

using Optim, LinearAlgebra

beta, kappa, sigma = 0.99, 0.15, 1.0
lambda_x, lambda_i = 0.1, 0.05

function welfare_loss(params; rho_u=0.5, sig_u=0.01)
    phi_pi, phi_y = params
    phi_pi < 1.0 && return 1e10  # enforce Taylor principle
    G0 = [1 -kappa; sigma*phi_pi 1+sigma*phi_y]
    G1 = [beta 0; -sigma 1]
    A = inv(G0)*G1
    all(abs.(eigvals(A)) .> 1) || return 1e10
    # MSV solution: scalar u shock
    Psi = inv(G0)*[1.0; 0.0]
    Omega = (I(2) - rho_u*A) \ Psi  # 2×1
    sig_z2 = sig_u^2 / (1 - rho_u^2)
    var_pi = Omega[1]^2 * sig_z2
    var_x  = Omega[2]^2 * sig_z2
    var_i  = (phi_pi*Omega[1] + phi_y*Omega[2])^2 * sig_z2
    return var_pi + lambda_x*var_x + lambda_i*var_i
end

res = optimize(p -> welfare_loss(p), [1.5, 0.5], NelderMead(),
               Optim.Options(x_tol=1e-8, f_tol=1e-10))
phi_opt = Optim.minimizer(res)
println("Optimal Taylor rule: φ_π=$(round(phi_opt[1],digits=3)), φ_y=$(round(phi_opt[2],digits=3))")
println("Welfare gain vs standard: $(round(100*(welfare_loss([1.5,0.5])-Optim.minimum(res))/welfare_loss([1.5,0.5]),digits=2))%")

library(nloptr)

beta<-0.99; kappa<-0.15; sigma<-1.0; lambda_x<-0.1; lambda_i<-0.05

welfare_loss <- function(params) {
  phi_pi<-params[1]; phi_y<-params[2]
  if(phi_pi<1.0) return(1e10)
  G0<-matrix(c(1,sigma*phi_pi,-kappa,1+sigma*phi_y),2,2)
  G1<-matrix(c(0.99,0,-sigma,1),2,2); rho_u<-0.5; sig_u<-0.01
  A<-solve(G0)%*%G1
  if(!all(Mod(eigen(A)$values)>1)) return(1e10)
  Psi<-solve(G0)%*%c(1,0)
  Omega<-solve(diag(2)-rho_u*A)%*%Psi
  sig_z2<-sig_u^2/(1-rho_u^2)
  var_pi<-Omega[1]^2*sig_z2; var_x<-Omega[2]^2*sig_z2
  var_i<-(phi_pi*Omega[1]+phi_y*Omega[2])^2*sig_z2
  var_pi + lambda_x*var_x + lambda_i*var_i
}

res <- nloptr(c(1.5,0.5), eval_f=welfare_loss,
              lb=c(1.01,0), ub=c(10,5),
              opts=list(algorithm="NLOPT_LN_NELDERMEAD",xtol_rel=1e-10,maxeval=5000))
cat(sprintf("Optimal: φ_π=%.3f, φ_y=%.3f\n", res$solution[1], res$solution[2]))

24.6 Programming Exercises¶

Exercise 24.1 (APL — BFGS from Scratch)¶

Implement the full BFGS algorithm in APL: (a) backtracking line search satisfying the Armijo condition; (b) BFGS update using the dfn from Section 24.2.3; (c) convergence check when ⌈/|grad x| < 1e¯6. Test on the 2D Rosenbrock function $f(x,y) = (1-x)^2 + 100(y-x^2)^2$ starting from $(0, 0)$ .

Exercise 24.2 (Python — Multiple Starting Points)¶

The NK welfare loss function can be multimodal in $(\phi_\pi, \phi_y)$ space when the model is near the boundary of determinacy. (a) Grid-search over $(\phi_\pi, \phi_y) \in [1.01, 6]\times[0, 3]$ with $20\times20$ points to map the loss surface. (b) Run BFGS from 50 random starting points inside the determinate region. (c) Verify that all starting points converge to the same minimum (if the loss is unimodal) or identify multiple local minima.

Exercise 24.3 (Julia — L-BFGS for DSGE Estimation)¶

using Optim, ForwardDiff

# Minimize NK log-likelihood (from Chapter 20 Kalman filter)
# Here: simplified version with analytical gradients via AD
function neg_log_lik_nk(params)
    # params = [phi_pi, phi_y, rho_u, sigma_u, sigma_obs]
    phi_pi, phi_y, rho_u, sigma_u, sigma_obs = exp.(params)  # log-parameterized
    # ... (Kalman filter evaluation from Ch.20)
    # Placeholder: return quadratic bowl for illustration
    return sum((params .- [log(1.5), log(0.5), log(0.7), log(0.01), log(0.005)]).^2)
end

# L-BFGS via Optim.jl
x0 = log.([1.5, 0.5, 0.7, 0.01, 0.005])
res = optimize(neg_log_lik_nk, x0, LBFGS(),
               Optim.Options(g_tol=1e-8, iterations=1000, show_trace=false))
println("L-BFGS converged: $(Optim.converged(res)), iterations: $(Optim.iterations(res))")
println("Parameters: $(round.(exp.(Optim.minimizer(res)), digits=4))")

Exercise 24.4 — Optimal Commitment Policy ( $\star$ )¶

Under commitment, the central bank minimizes $\mathcal{L} = \mathbb{E}\sum_{t=0}^\infty\beta^t(\pi_t^2 + \lambda_x x_t^2)$ choosing the path $\{i_t\}$ subject to the NKPC and DIS. (a) Derive the first-order conditions (targeting rules) from the Lagrangian with multipliers for the NKPC and DIS constraints. (b) Show the commitment optimum satisfies $\pi_t = -(\lambda_x/\kappa)(x_t - x_{t-1})$ — the price-level targeting result. (c) Numerically compare the welfare loss under commitment, discretion (time-consistent, ignoring the Euler equation constraint in each period), and the optimal Taylor rule.

24.7 Chapter Summary¶

Key results:

Gradient descent converges linearly at rate $1-\mu/L$ (condition number $\kappa=L/\mu$ governs convergence speed).
Newton’s method for optimization ( $\mathbf{x}_{n+1} = \mathbf{x}_n - H_f^{-1}\nabla f$ ) achieves quadratic convergence (Theorem 24.2) but costs $O(k^3)$ per iteration.
BFGS maintains a rank-2 update of the inverse Hessian approximation: $H_{n+1} = (I-\rho\mathbf{s}\mathbf{y}')H_n(I-\rho\mathbf{y}\mathbf{s}') + \rho\mathbf{s}\mathbf{s}'$ (derived as minimum-change update, Theorem 24.3). Superlinear convergence; $O(k^2)$ per iteration.
L-BFGS stores only the last $m$ pairs $(\mathbf{s}_i,\mathbf{y}_i)$ ; $O(mk)$ per iteration; standard for large DSGE problems.
Global optimization strategies (multiple starts, simulated annealing) are essential for multimodal DSGE likelihoods.
Optimal Taylor rule found by minimizing the discrete Lyapunov welfare loss over $(\phi_\pi,\phi_y)$ ; the Lyapunov equation for variances is vec_Sigma ← vec_Q ⌹ I - Phi kron Phi in APL.

Next: Chapter 25 — Solving Systems of Equations: From Linear Systems to Sparse Matrices