Multimodal, Capsule, and VLM Systems ==================================== Multimodal module families -------------------------- `grilly.nn.multimodal` includes several fusion strategies: - `BottleneckFusion` - `PerceiverIO` - `CrossModalAttentionFusion` - `ImageBindFusion` - `PerceiverResampler` - `FlamingoFusion` - `VisionLanguageModel` and `VLMLayer` These modules allow cross-modal reasoning across text, vision, and other feature streams. Capsule-related components -------------------------- Grilly also includes capsule-inspired modules and cognitive encoders, including capsule projection and semantic encoding paths used in experimental cognitive systems. When to use which approach -------------------------- 1. Start with `BottleneckFusion` when you need efficient two-stream fusion. 2. Use `PerceiverIO` for variable-length multimodal input handling. 3. Use `VisionLanguageModel` stack when building end-to-end VLM-style systems. Design choices -------------- The multimodal subsystem is intentionally pluralistic: 1. Multiple fusion architectures are provided because no single strategy wins across all modality/sequence regimes. 2. Bottleneck and resampler options target memory/compute efficiency. 3. VLM-oriented layers are exposed alongside lower-level fusion blocks so teams can choose between rapid assembly and custom architecture work. Example: simple multimodal fusion --------------------------------- .. code-block:: python import numpy as np import grilly.nn as nn fusion = nn.BottleneckFusion(d_model=256, num_bottlenecks=32, num_heads=8) vision = np.random.randn(2, 64, 256).astype(np.float32) text = np.random.randn(2, 32, 256).astype(np.float32) fused = fusion(vision, text) print(fused.shape) Design considerations --------------------- - Keep modality embeddings in compatible dimensions before fusion. - Monitor sequence lengths because attention cost still scales with token counts. - Use pooling or resampling to reduce large modality streams early.