grilly.optim.adamw
AdamW Optimizer
AdamW (Adam with decoupled Weight decay) - more effective regularization than Adam.
Key difference from Adam: - Adam: weight decay is added to gradient before moment updates (coupled) - AdamW: weight decay is applied directly to parameters after Adam step (decoupled)
Reference: “Decoupled Weight Decay Regularization” (Loshchilov & Hutter, 2019)
Uses: adamw-update.glsl
Classes
|
AdamW optimizer with decoupled weight decay. |
|
|
|
Base class for all optimizers. |
- class grilly.optim.adamw.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, use_gpu=True)[source]
Bases:
OptimizerAdamW optimizer with decoupled weight decay.
Implements the AdamW algorithm: - m = beta1 * m + (1 - beta1) * grad - v = beta2 * v + (1 - beta2) * grad^2 - m_hat = m / (1 - beta1^t) - v_hat = v / (1 - beta2^t) - param = param - lr * m_hat / (sqrt(v_hat) + eps) # Adam step - param = param - lr * weight_decay * param # Decoupled weight decay
This decoupling improves generalization compared to Adam’s coupled weight decay.
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad (bool) – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
- __init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, use_gpu=True)[source]
Initialize AdamW optimizer.
- Parameters
params (Iterator[numpy.ndarray]) – Iterator of parameter arrays to optimize
lr (float) – Learning rate (default: 1e-3)
betas (tuple) – Coefficients for computing running averages (default: (0.9, 0.999))
eps (float) – Term added to denominator for numerical stability (default: 1e-8)
weight_decay (float) – Decoupled weight decay coefficient (default: 0.01)
amsgrad (bool) – Whether to use AMSGrad variant (default: False)
use_gpu (bool) – Whether to use GPU acceleration (default: True)
Dependencies:
Nonedetected from callable globals.Variables:
params(collections.abc.Iterator[numpy.ndarray], required);lr(float, optional, default0.001);betas(tuple, optional, default(0.9, 0.999));eps(float, optional, default1e-08);weight_decay(float, optional, default0.01);amsgrad(bool, optional, defaultFalse);use_gpu(bool, optional, defaultTrue).Usage Example
import numpy as np from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance.__init__(params=np.zeros(1, dtype=np.float32), lr=0.001, betas=(), eps=1e-08, weight_decay=0.01, amsgrad=False, use_gpu=True)
- _get_backend()[source]
Get or create backend instance
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._get_backend()
- step(closure=None, gradients=None)[source]
Perform a single optimization step.
- Parameters
closure – Optional closure that reevaluates the model and returns loss
gradients – Optional dict mapping parameter IDs to gradients. If None, tries to get gradients from param.grad attribute.
Dependencies:
numpy.Variables:
closure(Any, optional, defaultNone);gradients(Any, optional, defaultNone).Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance.step(closure=None, gradients=None)
- _adamw_update_gpu(backend, param, grad, exp_avg, exp_avg_sq, lr, beta1, beta2, eps, weight_decay, beta1_t, beta2_t, amsgrad)[source]
GPU-accelerated AdamW update using adamw-update.glsl shader.
Dependencies:
numpy.Variables:
backend(Any, required);param(Any, required);grad(Any, required);exp_avg(Any, required);exp_avg_sq(Any, required);lr(Any, required);beta1(Any, required);beta2(Any, required);eps(Any, required);weight_decay(Any, required);beta1_t(Any, required);beta2_t(Any, required);amsgrad(Any, required).Usage Example
from grilly.optim.adamw import AdamW instance = AdamW(...) result = instance._adamw_update_gpu(backend=None, param=None, grad=None, exp_avg=None, exp_avg_sq=None, lr=None, beta1=None, beta2=None, eps=None, weight_decay=None, beta1_t=None, beta2_t=None, amsgrad=None)
- load_state_dict(state_dict)
Load optimizer state from state_dict.
- Parameters
state_dict (dict[str, Any]) – Dictionary containing optimizer state
Dependencies:
Nonedetected from callable globals.Variables:
state_dict(dict[str, typing.Any], required).Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.load_state_dict(state_dict='example')
- state_dict()
Return the state of the optimizer as a dict.
- Returns
Dictionary containing optimizer state
- Return type
dict[str, Any]
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.state_dict()
- zero_grad()
Clear gradients for all parameters.
Note: In this implementation, gradients are expected to be stored in a separate structure (e.g., in the model’s backward pass). This method is provided for API compatibility.
Dependencies:
Nonedetected from callable globals.Variables: This callable does not take explicit input variables.
Usage Example
from grilly.optim.base import Optimizer instance = Optimizer(...) result = instance.zero_grad()