Tutorial 10: Multimodal Fusion
Goal: fuse vision and text features with Grilly multimodal modules.
Step 1: Prepare modality tensors
import numpy as np
batch = 2
vision_tokens = 64
text_tokens = 32
d_model = 256
vision = np.random.randn(batch, vision_tokens, d_model).astype(np.float32)
text = np.random.randn(batch, text_tokens, d_model).astype(np.float32)
Step 2: Bottleneck fusion
import grilly.nn as nn
fusion = nn.BottleneckFusion(d_model=d_model, num_bottlenecks=32, num_heads=8)
fused = fusion(vision, text)
print("bottleneck fused:", fused.shape)
Step 3: Cross-modal attention fusion
cross = nn.CrossModalAttentionFusion(d_model=d_model, num_heads=8, num_encoder_layers=2)
cross_out = cross(vision, text)
print("cross fused:", cross_out.shape)
Step 4: Perceiver IO on variable input
perceiver = nn.PerceiverIO(input_dim=d_model, latent_dim=512, num_latents=128, num_heads=8)
latent = perceiver(vision)
print("latent shape:", latent.shape)
Step 5: Expand to full VLM stack
For larger projects, move from these fusion blocks into:
nn.VisionLanguageModel
nn.VLMLayer
nn.FlamingoFusion