Multimodal, Capsule, and VLM Systems
Multimodal module families
grilly.nn.multimodal includes several fusion strategies:
BottleneckFusion
PerceiverIO
CrossModalAttentionFusion
ImageBindFusion
PerceiverResampler
FlamingoFusion
VisionLanguageModel and VLMLayer
These modules allow cross-modal reasoning across text, vision, and other feature streams.
When to use which approach
Start with BottleneckFusion when you need efficient two-stream fusion.
Use PerceiverIO for variable-length multimodal input handling.
Use VisionLanguageModel stack when building end-to-end VLM-style systems.
Design choices
The multimodal subsystem is intentionally pluralistic:
Multiple fusion architectures are provided because no single strategy wins across all modality/sequence regimes.
Bottleneck and resampler options target memory/compute efficiency.
VLM-oriented layers are exposed alongside lower-level fusion blocks so teams can choose between rapid assembly and custom architecture work.
Example: simple multimodal fusion
import numpy as np
import grilly.nn as nn
fusion = nn.BottleneckFusion(d_model=256, num_bottlenecks=32, num_heads=8)
vision = np.random.randn(2, 64, 256).astype(np.float32)
text = np.random.randn(2, 32, 256).astype(np.float32)
fused = fusion(vision, text)
print(fused.shape)
Design considerations
Keep modality embeddings in compatible dimensions before fusion.
Monitor sequence lengths because attention cost still scales with token counts.
Use pooling or resampling to reduce large modality streams early.