Tutorial 04: Attention and Memory Workflow
==========================================

Goal: combine attention output with memory retrieval.

Step 1: Prepare attention inputs
--------------------------------

.. code-block:: python

   import numpy as np
   import grilly

   backend = grilly.Compute()

   batch = 4
   heads = 8
   seq = 32
   head_dim = 64

   q = np.random.randn(batch, heads, seq, head_dim).astype(np.float32)
   k = np.random.randn(batch, heads, seq, head_dim).astype(np.float32)
   v = np.random.randn(batch, heads, seq, head_dim).astype(np.float32)

Step 2: Compute attention output
--------------------------------

.. code-block:: python

   attn_out = backend.attention.flash_attention2(q, k, v)
   print("attention output:", attn_out.shape)

Step 3: Build memory database
-----------------------------

.. code-block:: python

   query = np.random.randn(1, 256).astype(np.float32)
   database = np.random.randn(5000, 256).astype(np.float32)

Step 4: Retrieve nearest vectors
--------------------------------

.. code-block:: python

   distances = backend.faiss.compute_distances(query, database)
   topk_values, topk_indices = backend.faiss.topk(distances, k=8)
   retrieved = database[topk_indices[0]]

Step 5: Use retrieved context
-----------------------------

At this point you can:

1. Concatenate retrieved vectors with model state.
2. Inject retrieved context before the next decoder/FFN block.
3. Re-rank or route candidates with additional similarity passes.

Step 6: Cleanup
---------------

.. code-block:: python

   backend.cleanup()