Training and Optimization

Training model in Grilly

Grilly training uses explicit gradient flow:

  1. Forward pass through modules.

  2. Compute loss value.

  3. Build gradient with respect to model output.

  4. Call backward on modules/containers.

  5. Call optimizer step.

Unlike frameworks with global autograd by default, many Grilly paths expose direct backward methods per module.

Canonical loop

import numpy as np
import grilly.nn as nn
import grilly.optim as optim

model = nn.Sequential(
    nn.Linear(128, 256),
    nn.GELU(),
    nn.Linear(256, 10),
)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

x = np.random.randn(64, 128).astype(np.float32)
y = np.random.randn(64, 10).astype(np.float32)

pred = model(x)
loss = np.mean((pred - y) ** 2)
grad_out = (2.0 / y.size) * (pred - y)

model.zero_grad()
model.backward(grad_out)
optimizer.step()

Optimizers

grilly.optim includes:

  • Adam, AdamW

  • SGD

  • NLMS

  • NaturalGradient

  • scheduler utilities (StepLR, CosineAnnealingLR, etc.)

The optimizer layer can use GPU-backed update paths when available, with CPU fallback behavior when needed.

Gradient lifecycle

Typical parameter lifecycle:

  • param.grad is written during backward.

  • optimizer.step() consumes gradient and updates parameter.

  • zero_grad() clears or resets gradients for the next step.

Numerical and runtime tips

  1. Keep all training arrays in float32.

  2. Verify gradients are finite before optimizer step.

  3. Start with small learning rates for custom module stacks.

  4. For long runs, checkpoint model/optimizer state periodically.