Training and Optimization
=========================

Training model in Grilly
------------------------

Grilly training uses explicit gradient flow:

1. Forward pass through modules.
2. Compute loss value.
3. Build gradient with respect to model output.
4. Call backward on modules/containers.
5. Call optimizer step.

Unlike frameworks with global autograd by default, many Grilly paths expose
direct backward methods per module.

Canonical loop
--------------

.. code-block:: python

   import numpy as np
   import grilly.nn as nn
   import grilly.optim as optim

   model = nn.Sequential(
       nn.Linear(128, 256),
       nn.GELU(),
       nn.Linear(256, 10),
   )
   optimizer = optim.Adam(model.parameters(), lr=1e-3)

   x = np.random.randn(64, 128).astype(np.float32)
   y = np.random.randn(64, 10).astype(np.float32)

   pred = model(x)
   loss = np.mean((pred - y) ** 2)
   grad_out = (2.0 / y.size) * (pred - y)

   model.zero_grad()
   model.backward(grad_out)
   optimizer.step()

Optimizers
----------

`grilly.optim` includes:

- `Adam`, `AdamW`
- `SGD`
- `NLMS`
- `NaturalGradient`
- scheduler utilities (`StepLR`, `CosineAnnealingLR`, etc.)

The optimizer layer can use GPU-backed update paths when available, with CPU
fallback behavior when needed.

Gradient lifecycle
------------------

Typical parameter lifecycle:

- `param.grad` is written during backward.
- `optimizer.step()` consumes gradient and updates parameter.
- `zero_grad()` clears or resets gradients for the next step.

Numerical and runtime tips
--------------------------

1. Keep all training arrays in `float32`.
2. Verify gradients are finite before optimizer step.
3. Start with small learning rates for custom module stacks.
4. For long runs, checkpoint model/optimizer state periodically.

Related advanced learning ops
-----------------------------

Beyond standard optimizers, Grilly includes GPU-accelerated learning kernels:

- Fisher information updates and EWC penalties.
- NLMS prediction and updates.
- Whitening transforms.
- Contrastive and specialized loss helpers.