Papers
Topics
Authors
Recent
2000 character limit reached

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts (2407.21770v3)

Published 31 Jul 2024 in cs.AI and cs.LG

Abstract: We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion LLMs. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion LLM pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  2. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530.
  3. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
  4. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023. URL https://arxiv.org/abs/2312.17172.
  5. Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024. URL https://arxiv.org/abs/2405.09818.
  6. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020. URL https://arxiv.org/abs/2006.16668.
  7. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101.03961.
  8. Unified scaling laws for routed language models, 2022. URL https://arxiv.org/abs/2202.01169.
  9. Mixtral of experts, 2024. URL https://arxiv.org/abs/2401.04088.
  10. Mixture-of-depths: Dynamically allocating compute in transformer-based language models, 2024. URL https://arxiv.org/abs/2404.02258.
  11. Multimodal contrastive learning with limoe: the language-image mixture of experts, 2022. URL https://arxiv.org/abs/2206.02770.
  12. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions, 2023. URL https://arxiv.org/abs/2209.03430.
  13. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts, 2022. URL https://arxiv.org/abs/2111.02358.
  14. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022a.
  15. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  16. Mixture-of-experts with expert choice routing, 2022. URL https://arxiv.org/abs/2202.09368.
  17. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  18. Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023. URL https://arxiv.org/abs/2212.05055.
  19. Image as a foreign language: Beit pretraining for all vision and vision-language tasks, 2022b. URL https://arxiv.org/abs/2208.10442.
  20. Swin transformer v2: Scaling up capacity and resolution, 2022a. URL https://arxiv.org/abs/2111.09883.
  21. Layerskip: Enabling early exit inference and self-speculative decoding, 2024. URL https://arxiv.org/abs/2404.16710.
  22. Openmoe: An early effort on open mixture-of-experts language models, 2024. URL https://arxiv.org/abs/2402.01739.
  23. Gumbel-attention for multi-modal machine translation, 2022b. URL https://arxiv.org/abs/2103.08862.
  24. How does selective mechanism improve self-attention networks?, 2020. URL https://arxiv.org/abs/2005.00979.
  25. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  26. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5:288–304, 2023.
  27. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. 2024.
  28. Efficient large scale language modeling with mixtures of experts. CoRR, abs/2112.10684, 2021. URL https://arxiv.org/abs/2112.10684.
  29. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  30. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pages 7432–7439, 2020.
  31. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  32. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  33. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  34. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  36. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  37. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  38. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  39. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  40. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021.
  41. Nüwa: Visual synthesis pre-training for neural visual world creation. CoRR, abs/2111.12417, 2021. URL https://arxiv.org/abs/2111.12417.
  42. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
Citations (12)

Summary

  • The paper introduces a novel MoMa architecture that divides experts into modality-specific groups, enhancing efficiency and integration of mixed-modal inputs.
  • It employs hierarchical routing and mixture-of-depths to optimize training and reduce computational load, with up to 3.7× FLOPs savings compared to baseline models.
  • The results imply significant advances in scalable multimodal pre-training, setting a foundation for efficient future AI systems combining text and image modalities.

Mixture of Modality-Aware Experts (MoMa)

The paper "MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts" (2407.21770) presents MoMa, a sophisticated architecture aimed at enhancing the efficiency of mixed-modal, early-fusion LLM pre-training. The architecture integrates images and text in arbitrary sequences via expert modules divided into modality-specific groups. These groups are responsible for processing designated tokens, leveraging learned routing within each group to maintain semantic adaptivity.

Introduction

Emerging auto-regressive mixed-modal foundation models, including Gemini and GPT-4, have demonstrated substantial potential in applications necessitating mixed-modal input processing and output generation across tasks like visual question answering. Traditional approaches involve fusing modality-specified encoders or decoders, which can impede the model's capability to integrate and generate interleaved modalities. To circumvent this limitation, Chameleon introduced a transformer architecture capable of seamless reasoning and generation across modalities using a next-token prediction objective. Despite showcasing robust vision and language capabilities, scaling to larger capacities introduces computational hurdles.

Previous implementations of routed sparse architectures have displayed efficacy in scaling specific language and vision models. The application to mixed-modal early-fusion models presents unique opportunities due to inherent modality heterogeneity—distinct information densities and redundancy patterns between text and image tokens. This has led to the proposal of modality-aware sparsity (MaS), optimizing the framework by incorporating modality-specific modules, facilitating precise modality feature capture while ensuring cross-modality integration through shared attention mechanisms.

Model Architecture

MoMa extends the standard Mixture-of-Experts (MoE) architecture by introducing a width scaling approach with modality-aware block sparsity:

  • Modality-Specific Expert Groups: Each MoE layer's experts are divided into modality-specific groups, specialized in processing tokens from their designated modality. This allows for improved efficiency, specialization, and cross-modal integration.
  • Hierarchical Routing involves two stages: initial routing based on modality followed by intra-modality routing within each group using a learned function. Expert-choice routing ensures balanced expert utilization, promoting stable optimization and simplifying training.
  • Inference Strategy: During inference, auxiliary routers predict selection based on token representation, allowing causality maintenance. Figure 1

    Figure 1: Overview of our proposed multimodal early-fusion architecture.

Mixture of Depths

Sparsity in-depth dimension is investigated using the Mixture-of-Depths (MoD) technique, where tokens can bypass attention and feed-forward computation in certain layers. This is performed prior to modality-specific splits, enhancing training efficiency while addressing depth scaling challenges. Figure 2

Figure 2: Architecture of transformer layer consisting of MoMa combined with mixture-of-depths (MoD).

Empirical Results

Extensive FLOPs-controlled experiments compare MoMa with dense and sparse baseline architectures. MoMa showcases significant pre-training efficiency gains with up to 3.7× overall FLOPs savings, notably superior to standard MoE models with mixed-modal parameters. When coupled with MoD, savings further improve, although causal inference performance indicates sensitivity to router accuracy. Figure 3

Figure 3: Scaling of performance with compute.

Implications and Future Directions

MoMa's design provides notable advances in resource efficiency and multimodal AI capabilities. The results highlight its scaling potential and suggest further exploration of combining width and depth scaling, router learning optimization, and model applications to broader modality sets and diverse tasks.

In summary, MoMa represents a significant step toward achieving more efficient multimodal foundation model pre-training. Future work should focus on exploring sophisticated routing and sparsity strategies to maximize model performance across intents and applications.

Conclusion

The introduction of MoMa exhibits promising improvements over conventional multimodal models, particularly in terms of operational efficiency and computational resource saving. By employing modality-awareness and effective scaling techniques, MoMa offers valuable advancements for future model developments, paving the path for extensive multimodal AI research and applications.

Whiteboard

Paper to Video (Beta)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 25 tweets with 1450 likes about this paper.