Attention, Transformers, and Decoding ===================================== Attention stack --------------- Grilly provides both module-level and backend-level attention: - Module API: `nn.MultiheadAttention`, `nn.FlashAttention2` - Backend API: `backend.attention.*` and `backend.attention.flash_attention2(...)` Use module attention when building end-to-end model graphs. Use backend attention when running direct kernel workflows and benchmarks. Why Flash Attention 2 matters ----------------------------- Flash-style attention reduces memory pressure for long sequences by computing attention in tiled blocks rather than materializing full intermediate matrices. In Grilly this can improve throughput and reduce OOM risk on practical sequence lengths, especially on consumer GPUs. Transformer-facing modules -------------------------- `grilly.nn` also includes transformer-oriented components: - `TransformerEncoderLayer` - `TransformerDecoderLayer` - `RoPE` - `ProsodyModulatedAttention` - decoding modules (`GreedyDecoder`, `SampleDecoder`) Basic attention example ----------------------- .. code-block:: python import numpy as np import grilly.nn as nn attn = nn.MultiheadAttention(embed_dim=256, num_heads=8) x = np.random.randn(4, 32, 256).astype(np.float32) out, weights = attn(query=x, key=x, value=x) print(out.shape, weights.shape) Backend flash attention example ------------------------------- .. code-block:: python import numpy as np import grilly backend = grilly.Compute() q = np.random.randn(2, 8, 64, 64).astype(np.float32) k = np.random.randn(2, 8, 64, 64).astype(np.float32) v = np.random.randn(2, 8, 64, 64).astype(np.float32) y = backend.attention.flash_attention2(q, k, v) print(y.shape) Decoding usage -------------- Decoder modules are used to convert logits to token decisions: - greedy decoding for deterministic paths - sampled decoding for stochastic generation They fit naturally after transformer output projection heads.