grilly.optim.hypergradient

Hypergradient Descent Optimizers

Implements online learning rate adaptation via hypergradient descent.

HypergradientAdamW: Basic hypergradient (Baydin et al. 2018).

Fixed beta_hyper. Simple but requires tuning beta_hyper.

AutoHypergradientAdamW: OSGM-style auto adjustment (arXiv:2502.11229).

Self-tuning via AdaGrad-stabilized hypergradients with gradient-norm normalization. No manual hypergradient LR tuning needed. Optional surprise signal: gradient prediction error exposed as current_surprise for input-level gain modulation. The model scales inputs by (1 + gain * surprise), amplifying signals when the optimization landscape shifts (e.g., SNN phase transitions).

The core idea: the learning rate is treated as a learnable parameter. At each step, the hypergradient h = -g_k . d_{k-1} / ||g_{k-1}||^2 tells us whether to increase or decrease the learning rate based on gradient agreement with the previous update direction.

References

[1] Baydin et al. “Online Learning Rate Adaptation with Hypergradient

Descent” (ICLR 2018)

[2] “Provable and Practical Online Learning Rate Adaptation with

Hypergradient Descent” (arXiv:2502.11229)

[3] “Gradient Methods with Online Scaling” (arXiv:2505.23081, 2509.11007)

Uses: adamw-update.glsl (via AdamW base class)

Classes

AdamW(params[, lr, betas, eps, ...])

AdamW optimizer with decoupled weight decay.

AutoHypergradientAdamW(params[, lr, betas, ...])

AdamW with OSGM-style auto hypergradient adjustment.

HypergradientAdamW(params[, lr, betas, eps, ...])

AdamW with hypergradient-based online learning rate adaptation.

Iterator()

grilly.optim.hypergradient._collect_grads(param_groups, gradients=None)[source]

Collect gradients from param groups into a dict keyed by param id.

Dependencies: numpy.

Variables: param_groups (Any, required); gradients (Any, optional, default None).

Usage Example

from grilly.optim.hypergradient import _collect_grads

result = _collect_grads(param_groups=None, gradients=None)
grilly.optim.hypergradient._compute_update_directions(param_groups, state, step_count, betas, eps)[source]

Compute Adam update directions d = m_hat / (sqrt(v_hat) + eps).

Dependencies: numpy.

Variables: param_groups (Any, required); state (Any, required); step_count (Any, required); betas (Any, required); eps (Any, required).

Usage Example

from grilly.optim.hypergradient import _compute_update_directions

result = _compute_update_directions(param_groups=None, state=None, step_count=None, betas=None, eps=None)
class grilly.optim.hypergradient.HypergradientAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)[source]

Bases: AdamW

AdamW with hypergradient-based online learning rate adaptation.

Basic version from Baydin et al. (2018). Uses a fixed hypergradient learning rate beta_hyper. Simple but requires manual tuning of beta_hyper. For a self-tuning version, use AutoHypergradientAdamW.

Update rule:

alpha_{t+1} = alpha_t + beta_hyper * sum(g_t * d_{t-1})

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Initial learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for running averages (default: (0.9, 0.999))

  • eps (float) – Numerical stability term (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay (default: 0.01)

  • beta_hyper (float) – Hypergradient learning rate (default: 1e-7)

  • lr_min (float) – Minimum learning rate clamp (default: 1e-6)

  • lr_max (float) – Maximum learning rate clamp (default: 1.0)

  • log_scale (bool) – If True, adapt log(lr) instead of lr (default: False)

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

Initialize AdamW optimizer.

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))

  • eps (float) – Term added to denominator for numerical stability (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)

  • amsgrad – Whether to use AMSGrad variant (default: False)

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

  • beta_hyper (float) –

  • lr_min (float) –

  • lr_max (float) –

  • log_scale (bool) –

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)[source]

Initialize AdamW optimizer.

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))

  • eps (float) – Term added to denominator for numerical stability (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)

  • amsgrad – Whether to use AMSGrad variant (default: False)

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

  • beta_hyper (float) –

  • lr_min (float) –

  • lr_max (float) –

  • log_scale (bool) –

Dependencies: None detected from callable globals.

Variables: params (collections.abc.Iterator[numpy.ndarray], required); lr (float, optional, default 0.001); betas (tuple, optional, default (0.9, 0.999)); eps (float, optional, default 1e-08); weight_decay (float, optional, default 0.01); beta_hyper (float, optional, default 1e-07); lr_min (float, optional, default 1e-06); lr_max (float, optional, default 1.0); log_scale (bool, optional, default False); use_gpu (bool, optional, default True).

Usage Example

import numpy as np
from grilly.optim.hypergradient import HypergradientAdamW

instance = HypergradientAdamW(...)
result = instance.__init__(params=np.zeros(1, dtype=np.float32), lr=0.001, betas=(), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)
property current_lr
property lr_history
step(closure=None, gradients=None)[source]

Perform a single optimization step.

Parameters
  • closure – Optional closure that reevaluates the model and returns loss

  • gradients – Optional dict mapping parameter IDs to gradients. If None, tries to get gradients from param.grad attribute.

Dependencies: numpy.

Variables: closure (Any, optional, default None); gradients (Any, optional, default None).

Usage Example

from grilly.optim.hypergradient import HypergradientAdamW

instance = HypergradientAdamW(...)
result = instance.step(closure=None, gradients=None)
_adamw_update_gpu(backend, param, grad, exp_avg, exp_avg_sq, lr, beta1, beta2, eps, weight_decay, beta1_t, beta2_t, amsgrad)

GPU-accelerated AdamW update using adamw-update.glsl shader.

Dependencies: numpy.

Variables: backend (Any, required); param (Any, required); grad (Any, required); exp_avg (Any, required); exp_avg_sq (Any, required); lr (Any, required); beta1 (Any, required); beta2 (Any, required); eps (Any, required); weight_decay (Any, required); beta1_t (Any, required); beta2_t (Any, required); amsgrad (Any, required).

Usage Example

from grilly.optim.adamw import AdamW

instance = AdamW(...)
result = instance._adamw_update_gpu(backend=None, param=None, grad=None, exp_avg=None, exp_avg_sq=None, lr=None, beta1=None, beta2=None, eps=None, weight_decay=None, beta1_t=None, beta2_t=None, amsgrad=None)
_get_backend()

Get or create backend instance

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.adamw import AdamW

instance = AdamW(...)
result = instance._get_backend()
load_state_dict(state_dict)

Load optimizer state from state_dict.

Parameters

state_dict (dict[str, Any]) – Dictionary containing optimizer state

Dependencies: None detected from callable globals.

Variables: state_dict (dict[str, typing.Any], required).

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.load_state_dict(state_dict='example')
state_dict()

Return the state of the optimizer as a dict.

Returns

Dictionary containing optimizer state

Return type

dict[str, Any]

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.state_dict()
zero_grad()

Clear gradients for all parameters.

Note: In this implementation, gradients are expected to be stored in a separate structure (e.g., in the model’s backward pass). This method is provided for API compatibility.

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.zero_grad()
class grilly.optim.hypergradient.AutoHypergradientAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)[source]

Bases: AdamW

AdamW with OSGM-style auto hypergradient adjustment.

Self-tuning optimizer that automatically adapts the learning rate (and optionally momentum beta1) using online hypergradient descent with AdaGrad-stabilized updates. No manual hypergradient LR tuning needed — the AdaGrad accumulator self-adjusts the meta-learning rate.

Based on the OSGM/HDM algorithm:

Step size hypergradient (how lr should change):

h_lr = -g_k . d_{k-1} / (||g_{k-1}||^2 + eps) G_lr += h_lr^2 lr -= hyper_lr * h_lr / (sqrt(G_lr) + eps)

Momentum hypergradient (how beta1 should change):

h_beta = g_k . m_{k-1} / (||g_{k-1}||^2 + eps) G_beta += h_beta^2 beta1 -= hyper_lr_beta * h_beta / (sqrt(G_beta) + eps)

The gradient-norm normalization (/ ||g||^2) makes the algorithm scale-invariant, and the AdaGrad accumulator makes the meta-LR self-adjusting — larger past hypergradients automatically slow down future adaptation, preventing oscillation.

Particularly effective for SNN training where surrogate gradients are noisy and the optimal learning rate shifts during training.

Surprise signal (optional, input-level):

Tracks gradient prediction error as a “surprise” signal and exposes it for the model to use as input gain modulation. Unlike backprop-level momentum changes, this acts at the forward-pass level — amplifying input signals when the optimization landscape shifts unexpectedly.

Instant surprise (gradient prediction error):

S_instant = tanh(||g_k - EMA(g)||^2 / (EMA(||g||^2) + eps))

Accumulated surprise (biological momentum / S_bar):

S_bar = alpha * S_instant + (1-alpha) * S_bar_prev

Inverted-U gain (Yerkes-Dodson / trauma protection):

gain = S_bar * exp(-S_bar / trauma_threshold)

The inverted-U curve implements the biological stress response:
  • Low S_bar → low gain (nothing interesting)

  • Moderate S_bar → peak gain (optimal learning zone)

  • High S_bar → gain drops (trauma protection)

This prevents “unerasable events” — if surprise stays high for many consecutive steps (chronic stress), the gain suppresses instead of amplifying, protecting the model from fixating on a single extreme event. Mirrors the HPA axis: acute stress enhances encoding, chronic stress impairs plasticity.

The model reads current_surprise_gain for input scaling:

x_effective = x * (1 + scale * optimizer.current_surprise_gain)

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Initial learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for running averages (default: (0.9, 0.999))

  • eps (float) – Numerical stability term (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay (default: 0.01)

  • hyper_lr (float) – Meta-learning rate for step size adaptation (default: 0.01). This is automatically modulated by the AdaGrad accumulator, so it’s much less sensitive than HypergradientAdamW’s beta_hyper.

  • hyper_lr_beta (float) – Meta-learning rate for momentum adaptation (default: 1.0). Only used when adapt_momentum=True.

  • lr_min (float) – Minimum learning rate clamp (default: 1e-6)

  • lr_max (float) – Maximum learning rate clamp (default: 1.0)

  • adapt_momentum (bool) – If True, also adapt beta1 via hypergradient (default: False)

  • track_surprise (bool) – If True, compute and expose gradient surprise signal via current_surprise_gain (default: False). The model’s forward pass should read this to modulate input gain.

  • surprise_gamma (float) – EMA decay for gradient tracking (default: 0.9). Higher = smoother baseline, slower to detect change.

  • surprise_alpha (float) – EMA decay for surprise accumulation S_bar (default: 0.1). Controls how fast accumulated surprise builds up and decays. Lower = longer memory of surprise.

  • trauma_threshold (float) – S_bar level where gain peaks before suppression (default: 0.5). The inverted-U gain = S_bar * exp(-S_bar/T) peaks at S_bar = T. Above this, gain decreases (protection).

  • beta_min (float) – Minimum beta1 clamp (default: 0.5)

  • beta_max (float) – Maximum beta1 clamp (default: 0.9995)

  • warmup_steps (int) – Steps before starting adaptation (default: 10). Lets Adam moments initialize before adapting LR.

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

Initialize AdamW optimizer.

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))

  • eps (float) – Term added to denominator for numerical stability (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)

  • amsgrad – Whether to use AMSGrad variant (default: False)

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

  • hyper_lr (float) –

  • hyper_lr_beta (float) –

  • lr_min (float) –

  • lr_max (float) –

  • adapt_momentum (bool) –

  • track_surprise (bool) –

  • surprise_gamma (float) –

  • surprise_alpha (float) –

  • trauma_threshold (float) –

  • beta_min (float) –

  • beta_max (float) –

  • warmup_steps (int) –

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)[source]

Initialize AdamW optimizer.

Parameters
  • params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize

  • lr (float) – Learning rate (default: 1e-3)

  • betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))

  • eps (float) – Term added to denominator for numerical stability (default: 1e-8)

  • weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)

  • amsgrad – Whether to use AMSGrad variant (default: False)

  • use_gpu (bool) – Whether to use GPU acceleration (default: True)

  • hyper_lr (float) –

  • hyper_lr_beta (float) –

  • lr_min (float) –

  • lr_max (float) –

  • adapt_momentum (bool) –

  • track_surprise (bool) –

  • surprise_gamma (float) –

  • surprise_alpha (float) –

  • trauma_threshold (float) –

  • beta_min (float) –

  • beta_max (float) –

  • warmup_steps (int) –

Dependencies: None detected from callable globals.

Variables: params (collections.abc.Iterator[numpy.ndarray], required); lr (float, optional, default 0.001); betas (tuple, optional, default (0.9, 0.999)); eps (float, optional, default 1e-08); weight_decay (float, optional, default 0.01); hyper_lr (float, optional, default 0.01); hyper_lr_beta (float, optional, default 1.0); lr_min (float, optional, default 1e-06); lr_max (float, optional, default 1.0); adapt_momentum (bool, optional, default False); track_surprise (bool, optional, default False); surprise_gamma (float, optional, default 0.9); surprise_alpha (float, optional, default 0.1); trauma_threshold (float, optional, default 0.5); beta_min (float, optional, default 0.5); beta_max (float, optional, default 0.9995); warmup_steps (int, optional, default 10); use_gpu (bool, optional, default True).

Usage Example

import numpy as np
from grilly.optim.hypergradient import AutoHypergradientAdamW

instance = AutoHypergradientAdamW(...)
result = instance.__init__(params=np.zeros(1, dtype=np.float32), lr=0.001, betas=(), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)
property current_lr
property current_surprise

Instant surprise signal [0, 1]. Raw gradient prediction error.

property accumulated_surprise

Accumulated surprise S_bar. Biological momentum of surprise.

property current_surprise_gain

Inverted-U gain signal for input-level modulation.

Implements the Yerkes-Dodson curve / trauma protection:

gain = S_bar * exp(-S_bar / trauma_threshold)

  • Low S_bar → low gain (nothing interesting happening)

  • Moderate S_bar → peak gain (optimal learning zone)

  • High S_bar → gain drops (trauma protection, don’t fixate)

Read this after each optimizer step and pass to the model:

x_effective = x * (1 + scale * optimizer.current_surprise_gain)

Returns 0.0 when surprise tracking is off or during warmup.

property lr_history
property beta1_history
property surprise_history
property s_bar_history
step(closure=None, gradients=None)[source]

Perform optimization step with OSGM-style auto LR adaptation.

  1. Collect current gradients g_k

  2. Compute surprise signal (if track_surprise=True)

  3. Compute normalized hypergradients (after warmup): h_lr = -g_k . d_{k-1} / ||g_{k-1}||^2 h_beta = g_k . m_{k-1} / ||g_{k-1}||^2

  4. Update AdaGrad accumulators and adjust lr (and beta1)

  5. Run standard AdamW step with adapted hyperparameters

  6. Store d_k, ||g_k||^2, m_k for next step

Dependencies: numpy.

Variables: closure (Any, optional, default None); gradients (Any, optional, default None).

Usage Example

from grilly.optim.hypergradient import AutoHypergradientAdamW

instance = AutoHypergradientAdamW(...)
result = instance.step(closure=None, gradients=None)
_adamw_update_gpu(backend, param, grad, exp_avg, exp_avg_sq, lr, beta1, beta2, eps, weight_decay, beta1_t, beta2_t, amsgrad)

GPU-accelerated AdamW update using adamw-update.glsl shader.

Dependencies: numpy.

Variables: backend (Any, required); param (Any, required); grad (Any, required); exp_avg (Any, required); exp_avg_sq (Any, required); lr (Any, required); beta1 (Any, required); beta2 (Any, required); eps (Any, required); weight_decay (Any, required); beta1_t (Any, required); beta2_t (Any, required); amsgrad (Any, required).

Usage Example

from grilly.optim.adamw import AdamW

instance = AdamW(...)
result = instance._adamw_update_gpu(backend=None, param=None, grad=None, exp_avg=None, exp_avg_sq=None, lr=None, beta1=None, beta2=None, eps=None, weight_decay=None, beta1_t=None, beta2_t=None, amsgrad=None)
_get_backend()

Get or create backend instance

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.adamw import AdamW

instance = AdamW(...)
result = instance._get_backend()
load_state_dict(state_dict)

Load optimizer state from state_dict.

Parameters

state_dict (dict[str, Any]) – Dictionary containing optimizer state

Dependencies: None detected from callable globals.

Variables: state_dict (dict[str, typing.Any], required).

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.load_state_dict(state_dict='example')
state_dict()

Return the state of the optimizer as a dict.

Returns

Dictionary containing optimizer state

Return type

dict[str, Any]

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.state_dict()
zero_grad()

Clear gradients for all parameters.

Note: In this implementation, gradients are expected to be stored in a separate structure (e.g., in the model’s backward pass). This method is provided for API compatibility.

Dependencies: None detected from callable globals.

Variables: This callable does not take explicit input variables.

Usage Example

from grilly.optim.base import Optimizer

instance = Optimizer(...)
result = instance.zero_grad()