grilly.optim.hypergradient
Hypergradient Descent Optimizers
Implements online learning rate adaptation via hypergradient descent.
- HypergradientAdamW: Basic hypergradient (Baydin et al. 2018).
Fixed beta_hyper. Simple but requires tuning beta_hyper.
- AutoHypergradientAdamW: OSGM-style auto adjustment (arXiv:2502.11229).
Self-tuning via AdaGrad-stabilized hypergradients with gradient-norm normalization. No manual hypergradient LR tuning needed. Optional surprise signal: gradient prediction error exposed as current_surprise for input-level gain modulation. The model scales inputs by (1 + gain * surprise), amplifying signals when the optimization landscape shifts (e.g., SNN phase transitions).
The core idea: the learning rate is treated as a learnable parameter. At each step, the hypergradient h = -g_k . d_{k-1} / ||g_{k-1}||^2 tells us whether to increase or decrease the learning rate based on gradient agreement with the previous update direction.
References
- [1] Baydin et al. “Online Learning Rate Adaptation with Hypergradient
Descent” (ICLR 2018)
- [2] “Provable and Practical Online Learning Rate Adaptation with
Hypergradient Descent” (arXiv:2502.11229)
[3] “Gradient Methods with Online Scaling” (arXiv:2505.23081, 2509.11007)
Uses: adamw-update.glsl (via AdamW base class)
Classes
|
AdamW optimizer with decoupled weight decay. |
|
AdamW with OSGM-style auto hypergradient adjustment. |
|
AdamW with hypergradient-based online learning rate adaptation. |
|
- grilly.optim.hypergradient._collect_grads(param_groups, gradients=None)[source]
Collect gradients from param groups into a dict keyed by param id.
Dependencies:
numpy.Variables:
param_groups(Any, required);gradients(Any, optional, defaultNone).Usage Example
from grilly.optim.hypergradient import _collect_grads result = _collect_grads(param_groups=None, gradients=None)
- grilly.optim.hypergradient._compute_update_directions(param_groups, state, step_count, betas, eps)[source]
Compute Adam update directions d = m_hat / (sqrt(v_hat) + eps).
Dependencies:
numpy.Variables:
param_groups(Any, required);state(Any, required);step_count(Any, required);betas(Any, required);eps(Any, required).Usage Example
from grilly.optim.hypergradient import _compute_update_directions result = _compute_update_directions(param_groups=None, state=None, step_count=None, betas=None, eps=None)
- class grilly.optim.hypergradient.HypergradientAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)[source]
Bases:
AdamWAdamW with hypergradient-based online learning rate adaptation.
Basic version from Baydin et al. (2018). Uses a fixed hypergradient learning rate beta_hyper. Simple but requires manual tuning of beta_hyper. For a self-tuning version, use AutoHypergradientAdamW.
- Update rule:
alpha_{t+1} = alpha_t + beta_hyper * sum(g_t * d_{t-1})
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Initial learning rate (default: 1e-3)
betas (tuple) – Coefficients for running averages (default: (0.9, 0.999))
eps (float) – Numerical stability term (default: 1e-8)
weight_decay (float) – Decoupled weight decay (default: 0.01)
beta_hyper (float) – Hypergradient learning rate (default: 1e-7)
lr_min (float) – Minimum learning rate clamp (default: 1e-6)
lr_max (float) – Maximum learning rate clamp (default: 1.0)
log_scale (bool) – If True, adapt log(lr) instead of lr (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
beta_hyper (float) –
lr_min (float) –
lr_max (float) –
log_scale (bool) –
- __init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)[source]
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
beta_hyper (float) –
lr_min (float) –
lr_max (float) –
log_scale (bool) –
Dependencies:
Nonedetected from callable globals.Variables:
params(collections.abc.Iterator[numpy.ndarray], required);lr(float, optional, default0.001);betas(tuple, optional, default(0.9, 0.999));eps(float, optional, default1e-08);weight_decay(float, optional, default0.01);beta_hyper(float, optional, default1e-07);lr_min(float, optional, default1e-06);lr_max(float, optional, default1.0);log_scale(bool, optional, defaultFalse);use_gpu(bool, optional, defaultTrue).Usage Example
import numpy as np from grilly.optim.hypergradient import HypergradientAdamW instance = HypergradientAdamW(...) result = instance.__init__(params=np.zeros(1, dtype=np.float32), lr=0.001, betas=(), eps=1e-08, weight_decay=0.01, beta_hyper=1e-07, lr_min=1e-06, lr_max=1.0, log_scale=False, use_gpu=True)
- property current_lr
- property lr_history
- step(closure=None, gradients=None)[source]
Perform a single optimization step.
- Parameters
closure – Optional closure that reevaluates the model and returns loss
gradients – Optional dict mapping parameter IDs to gradients. If None, tries to get gradients from param.grad attribute.
Dependencies:
numpy.Variables:
closure(Any, optional, defaultNone);gradients(Any, optional, defaultNone).Usage Example
from grilly.optim.hypergradient import HypergradientAdamW instance = HypergradientAdamW(...) result = instance.step(closure=None, gradients=None)
- _adamw_update_gpu(backend, param, grad, exp_avg, exp_avg_sq, lr, beta1, beta2, eps, weight_decay, beta1_t, beta2_t, amsgrad)
GPU-accelerated AdamW update using adamw-update.glsl shader.
Dependencies:
numpy.Variables:
backend(Any, required);param(Any, required);grad(Any, required);exp_avg(Any, required);exp_avg_sq(Any, required);lr(Any, required);beta1(Any, required);beta2(Any, required);eps(Any, required);weight_decay(Any, required);beta1_t(Any, required);beta2_t(Any, required);amsgrad(Any, required).Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._adamw_update_gpu(backend=None, param=None, grad=None, exp_avg=None, exp_avg_sq=None, lr=None, beta1=None, beta2=None, eps=None, weight_decay=None, beta1_t=None, beta2_t=None, amsgrad=None)
- _get_backend()
Get or create backend instance
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._get_backend()
- load_state_dict(state_dict)
Load optimizer state from state_dict.
- Parameters
state_dict (dict[str, Any]) – Dictionary containing optimizer state
Dependencies:
Nonedetected from callable globals.Variables:
state_dict(dict[str, typing.Any], required).Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.load_state_dict(state_dict='example')
- state_dict()
Return the state of the optimizer as a dict.
- Returns
Dictionary containing optimizer state
- Return type
dict[str, Any]
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.state_dict()
- zero_grad()
Clear gradients for all parameters.
Note: In this implementation, gradients are expected to be stored in a separate structure (e.g., in the model’s backward pass). This method is provided for API compatibility.
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.zero_grad()
- class grilly.optim.hypergradient.AutoHypergradientAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)[source]
Bases:
AdamWAdamW with OSGM-style auto hypergradient adjustment.
Self-tuning optimizer that automatically adapts the learning rate (and optionally momentum beta1) using online hypergradient descent with AdaGrad-stabilized updates. No manual hypergradient LR tuning needed — the AdaGrad accumulator self-adjusts the meta-learning rate.
Based on the OSGM/HDM algorithm:
- Step size hypergradient (how lr should change):
h_lr = -g_k . d_{k-1} / (||g_{k-1}||^2 + eps) G_lr += h_lr^2 lr -= hyper_lr * h_lr / (sqrt(G_lr) + eps)
- Momentum hypergradient (how beta1 should change):
h_beta = g_k . m_{k-1} / (||g_{k-1}||^2 + eps) G_beta += h_beta^2 beta1 -= hyper_lr_beta * h_beta / (sqrt(G_beta) + eps)
The gradient-norm normalization (/ ||g||^2) makes the algorithm scale-invariant, and the AdaGrad accumulator makes the meta-LR self-adjusting — larger past hypergradients automatically slow down future adaptation, preventing oscillation.
Particularly effective for SNN training where surrogate gradients are noisy and the optimal learning rate shifts during training.
- Surprise signal (optional, input-level):
Tracks gradient prediction error as a “surprise” signal and exposes it for the model to use as input gain modulation. Unlike backprop-level momentum changes, this acts at the forward-pass level — amplifying input signals when the optimization landscape shifts unexpectedly.
- Instant surprise (gradient prediction error):
S_instant = tanh(||g_k - EMA(g)||^2 / (EMA(||g||^2) + eps))
- Accumulated surprise (biological momentum / S_bar):
S_bar = alpha * S_instant + (1-alpha) * S_bar_prev
- Inverted-U gain (Yerkes-Dodson / trauma protection):
gain = S_bar * exp(-S_bar / trauma_threshold)
- The inverted-U curve implements the biological stress response:
Low S_bar → low gain (nothing interesting)
Moderate S_bar → peak gain (optimal learning zone)
High S_bar → gain drops (trauma protection)
This prevents “unerasable events” — if surprise stays high for many consecutive steps (chronic stress), the gain suppresses instead of amplifying, protecting the model from fixating on a single extreme event. Mirrors the HPA axis: acute stress enhances encoding, chronic stress impairs plasticity.
- The model reads current_surprise_gain for input scaling:
x_effective = x * (1 + scale * optimizer.current_surprise_gain)
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Initial learning rate (default: 1e-3)
betas (tuple) – Coefficients for running averages (default: (0.9, 0.999))
eps (float) – Numerical stability term (default: 1e-8)
weight_decay (float) – Decoupled weight decay (default: 0.01)
hyper_lr (float) – Meta-learning rate for step size adaptation (default: 0.01). This is automatically modulated by the AdaGrad accumulator, so it’s much less sensitive than HypergradientAdamW’s beta_hyper.
hyper_lr_beta (float) – Meta-learning rate for momentum adaptation (default: 1.0). Only used when adapt_momentum=True.
lr_min (float) – Minimum learning rate clamp (default: 1e-6)
lr_max (float) – Maximum learning rate clamp (default: 1.0)
adapt_momentum (bool) – If True, also adapt beta1 via hypergradient (default: False)
track_surprise (bool) – If True, compute and expose gradient surprise signal via current_surprise_gain (default: False). The model’s forward pass should read this to modulate input gain.
surprise_gamma (float) – EMA decay for gradient tracking (default: 0.9). Higher = smoother baseline, slower to detect change.
surprise_alpha (float) – EMA decay for surprise accumulation S_bar (default: 0.1). Controls how fast accumulated surprise builds up and decays. Lower = longer memory of surprise.
trauma_threshold (float) – S_bar level where gain peaks before suppression (default: 0.5). The inverted-U gain = S_bar * exp(-S_bar/T) peaks at S_bar = T. Above this, gain decreases (protection).
beta_min (float) – Minimum beta1 clamp (default: 0.5)
beta_max (float) – Maximum beta1 clamp (default: 0.9995)
warmup_steps (int) – Steps before starting adaptation (default: 10). Lets Adam moments initialize before adapting LR.
use_gpu (bool) – Whether to use GPU acceleration (default: True)
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
hyper_lr (float) –
hyper_lr_beta (float) –
lr_min (float) –
lr_max (float) –
adapt_momentum (bool) –
track_surprise (bool) –
surprise_gamma (float) –
surprise_alpha (float) –
trauma_threshold (float) –
beta_min (float) –
beta_max (float) –
warmup_steps (int) –
- __init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)[source]
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
hyper_lr (float) –
hyper_lr_beta (float) –
lr_min (float) –
lr_max (float) –
adapt_momentum (bool) –
track_surprise (bool) –
surprise_gamma (float) –
surprise_alpha (float) –
trauma_threshold (float) –
beta_min (float) –
beta_max (float) –
warmup_steps (int) –
Dependencies:
Nonedetected from callable globals.Variables:
params(collections.abc.Iterator[numpy.ndarray], required);lr(float, optional, default0.001);betas(tuple, optional, default(0.9, 0.999));eps(float, optional, default1e-08);weight_decay(float, optional, default0.01);hyper_lr(float, optional, default0.01);hyper_lr_beta(float, optional, default1.0);lr_min(float, optional, default1e-06);lr_max(float, optional, default1.0);adapt_momentum(bool, optional, defaultFalse);track_surprise(bool, optional, defaultFalse);surprise_gamma(float, optional, default0.9);surprise_alpha(float, optional, default0.1);trauma_threshold(float, optional, default0.5);beta_min(float, optional, default0.5);beta_max(float, optional, default0.9995);warmup_steps(int, optional, default10);use_gpu(bool, optional, defaultTrue).Usage Example
import numpy as np from grilly.optim.hypergradient import AutoHypergradientAdamW instance = AutoHypergradientAdamW(...) result = instance.__init__(params=np.zeros(1, dtype=np.float32), lr=0.001, betas=(), eps=1e-08, weight_decay=0.01, hyper_lr=0.01, hyper_lr_beta=1.0, lr_min=1e-06, lr_max=1.0, adapt_momentum=False, track_surprise=False, surprise_gamma=0.9, surprise_alpha=0.1, trauma_threshold=0.5, beta_min=0.5, beta_max=0.9995, warmup_steps=10, use_gpu=True)
- property current_lr
- property current_surprise
Instant surprise signal [0, 1]. Raw gradient prediction error.
- property accumulated_surprise
Accumulated surprise S_bar. Biological momentum of surprise.
- property current_surprise_gain
Inverted-U gain signal for input-level modulation.
- Implements the Yerkes-Dodson curve / trauma protection:
gain = S_bar * exp(-S_bar / trauma_threshold)
Low S_bar → low gain (nothing interesting happening)
Moderate S_bar → peak gain (optimal learning zone)
High S_bar → gain drops (trauma protection, don’t fixate)
- Read this after each optimizer step and pass to the model:
x_effective = x * (1 + scale * optimizer.current_surprise_gain)
Returns 0.0 when surprise tracking is off or during warmup.
- property lr_history
- property beta1_history
- property surprise_history
- property s_bar_history
- step(closure=None, gradients=None)[source]
Perform optimization step with OSGM-style auto LR adaptation.
Collect current gradients g_k
Compute surprise signal (if track_surprise=True)
Compute normalized hypergradients (after warmup): h_lr = -g_k . d_{k-1} / ||g_{k-1}||^2 h_beta = g_k . m_{k-1} / ||g_{k-1}||^2
Update AdaGrad accumulators and adjust lr (and beta1)
Run standard AdamW step with adapted hyperparameters
Store d_k, ||g_k||^2, m_k for next step
Dependencies:
numpy.Variables:
closure(Any, optional, defaultNone);gradients(Any, optional, defaultNone).Usage Example
from grilly.optim.hypergradient import AutoHypergradientAdamW instance = AutoHypergradientAdamW(...) result = instance.step(closure=None, gradients=None)
- _adamw_update_gpu(backend, param, grad, exp_avg, exp_avg_sq, lr, beta1, beta2, eps, weight_decay, beta1_t, beta2_t, amsgrad)
GPU-accelerated AdamW update using adamw-update.glsl shader.
Dependencies:
numpy.Variables:
backend(Any, required);param(Any, required);grad(Any, required);exp_avg(Any, required);exp_avg_sq(Any, required);lr(Any, required);beta1(Any, required);beta2(Any, required);eps(Any, required);weight_decay(Any, required);beta1_t(Any, required);beta2_t(Any, required);amsgrad(Any, required).Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._adamw_update_gpu(backend=None, param=None, grad=None, exp_avg=None, exp_avg_sq=None, lr=None, beta1=None, beta2=None, eps=None, weight_decay=None, beta1_t=None, beta2_t=None, amsgrad=None)
- _get_backend()
Get or create backend instance
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._get_backend()
- load_state_dict(state_dict)
Load optimizer state from state_dict.
- Parameters
state_dict (dict[str, Any]) – Dictionary containing optimizer state
Dependencies:
Nonedetected from callable globals.Variables:
state_dict(dict[str, typing.Any], required).Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.load_state_dict(state_dict='example')
- state_dict()
Return the state of the optimizer as a dict.
- Returns
Dictionary containing optimizer state
- Return type
dict[str, Any]
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.state_dict()
- zero_grad()
Clear gradients for all parameters.
Note: In this implementation, gradients are expected to be stored in a separate structure (e.g., in the model’s backward pass). This method is provided for API compatibility.
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.zero_grad()