Tutorial 10: Multimodal Fusion ============================== Goal: fuse vision and text features with Grilly multimodal modules. Step 1: Prepare modality tensors -------------------------------- .. code-block:: python import numpy as np batch = 2 vision_tokens = 64 text_tokens = 32 d_model = 256 vision = np.random.randn(batch, vision_tokens, d_model).astype(np.float32) text = np.random.randn(batch, text_tokens, d_model).astype(np.float32) Step 2: Bottleneck fusion ------------------------- .. code-block:: python import grilly.nn as nn fusion = nn.BottleneckFusion(d_model=d_model, num_bottlenecks=32, num_heads=8) fused = fusion(vision, text) print("bottleneck fused:", fused.shape) Step 3: Cross-modal attention fusion ------------------------------------ .. code-block:: python cross = nn.CrossModalAttentionFusion(d_model=d_model, num_heads=8, num_encoder_layers=2) cross_out = cross(vision, text) print("cross fused:", cross_out.shape) Step 4: Perceiver IO on variable input -------------------------------------- .. code-block:: python perceiver = nn.PerceiverIO(input_dim=d_model, latent_dim=512, num_latents=128, num_heads=8) latent = perceiver(vision) print("latent shape:", latent.shape) Step 5: Expand to full VLM stack -------------------------------- For larger projects, move from these fusion blocks into: - `nn.VisionLanguageModel` - `nn.VLMLayer` - `nn.FlamingoFusion`