Multimodal, Capsule, and VLM Systems

Multimodal module families

grilly.nn.multimodal includes several fusion strategies:

  • BottleneckFusion

  • PerceiverIO

  • CrossModalAttentionFusion

  • ImageBindFusion

  • PerceiverResampler

  • FlamingoFusion

  • VisionLanguageModel and VLMLayer

These modules allow cross-modal reasoning across text, vision, and other feature streams.

When to use which approach

  1. Start with BottleneckFusion when you need efficient two-stream fusion.

  2. Use PerceiverIO for variable-length multimodal input handling.

  3. Use VisionLanguageModel stack when building end-to-end VLM-style systems.

Design choices

The multimodal subsystem is intentionally pluralistic:

  1. Multiple fusion architectures are provided because no single strategy wins across all modality/sequence regimes.

  2. Bottleneck and resampler options target memory/compute efficiency.

  3. VLM-oriented layers are exposed alongside lower-level fusion blocks so teams can choose between rapid assembly and custom architecture work.

Example: simple multimodal fusion

import numpy as np
import grilly.nn as nn

fusion = nn.BottleneckFusion(d_model=256, num_bottlenecks=32, num_heads=8)

vision = np.random.randn(2, 64, 256).astype(np.float32)
text = np.random.randn(2, 32, 256).astype(np.float32)

fused = fusion(vision, text)
print(fused.shape)

Design considerations

  • Keep modality embeddings in compatible dimensions before fusion.

  • Monitor sequence lengths because attention cost still scales with token counts.

  • Use pooling or resampling to reduce large modality streams early.