Training and Optimization
Training model in Grilly
Grilly training uses explicit gradient flow:
Forward pass through modules.
Compute loss value.
Build gradient with respect to model output.
Call backward on modules/containers.
Call optimizer step.
Unlike frameworks with global autograd by default, many Grilly paths expose direct backward methods per module.
Canonical loop
import numpy as np
import grilly.nn as nn
import grilly.optim as optim
model = nn.Sequential(
nn.Linear(128, 256),
nn.GELU(),
nn.Linear(256, 10),
)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
x = np.random.randn(64, 128).astype(np.float32)
y = np.random.randn(64, 10).astype(np.float32)
pred = model(x)
loss = np.mean((pred - y) ** 2)
grad_out = (2.0 / y.size) * (pred - y)
model.zero_grad()
model.backward(grad_out)
optimizer.step()
Optimizers
grilly.optim includes:
Adam, AdamW
SGD
NLMS
NaturalGradient
scheduler utilities (StepLR, CosineAnnealingLR, etc.)
The optimizer layer can use GPU-backed update paths when available, with CPU fallback behavior when needed.
Gradient lifecycle
Typical parameter lifecycle:
param.grad is written during backward.
optimizer.step() consumes gradient and updates parameter.
zero_grad() clears or resets gradients for the next step.
Numerical and runtime tips
Keep all training arrays in float32.
Verify gradients are finite before optimizer step.
Start with small learning rates for custom module stacks.
For long runs, checkpoint model/optimizer state periodically.