Tutorial 10: Multimodal Fusion

Goal: fuse vision and text features with Grilly multimodal modules.

Step 1: Prepare modality tensors

import numpy as np

batch = 2
vision_tokens = 64
text_tokens = 32
d_model = 256

vision = np.random.randn(batch, vision_tokens, d_model).astype(np.float32)
text = np.random.randn(batch, text_tokens, d_model).astype(np.float32)

Step 2: Bottleneck fusion

import grilly.nn as nn

fusion = nn.BottleneckFusion(d_model=d_model, num_bottlenecks=32, num_heads=8)
fused = fusion(vision, text)
print("bottleneck fused:", fused.shape)

Step 4: Perceiver IO on variable input

perceiver = nn.PerceiverIO(input_dim=d_model, latent_dim=512, num_latents=128, num_heads=8)
latent = perceiver(vision)
print("latent shape:", latent.shape)

Step 5: Expand to full VLM stack

For larger projects, move from these fusion blocks into:

nn.VisionLanguageModel
nn.VLMLayer
nn.FlamingoFusion