MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT (2402.16840v1)
Abstract: "Bigger the better" has been the predominant trend in recent LLMs development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small LLMs (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost. Our work strives to not only bridge the gap in open-source SLMs but also ensures full transparency, where complete training data pipeline, training code, model weights, and over 300 checkpoints along with evaluation codes is available at : https://github.com/mbzuai-oryx/MobiLlama.
- The falcon series of open language models.
- Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024.
- Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- A framework for few-shot language model evaluation.
- A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC.
- Olmo: Accelerating the science of language models. arXiv preprint.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
- Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
- Starcoder: may the source be with you!
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Few-shot learning with multilingual language models. CoRR, abs/2112.10668.
- Visual instruction tuning.
- Llm360: Towards fully transparent open-source llms.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.
- Interpretability-aware redundancy reduction for vision transformers. US Patent App. 17/559,053.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- One wide feedforward is all you need. arXiv preprint arXiv:2309.01826.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
- Opt: Open pre-trained transformer language models.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.