MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series (2405.19327v4)
Abstract: LLMs have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-sourced several strong LLMs like LLaMA-3, comparable to existing closed-source LLMs. However, only the model's weights are provided with most details (e.g., intermediate checkpoints, pre-training corpus, and training code, etc.) being undisclosed. To improve the transparency of LLMs, the research community has formed to open-source truly open LLMs (e.g., Pythia, Amber, OLMo), where more details (e.g., pre-training corpus and training code) are being provided. These models have greatly advanced the scientific study of these large models including their strengths, weaknesses, biases and risks. However, we observe that the existing truly open LLMs on reasoning, knowledge, and coding tasks are still inferior to existing state-of-the-art LLMs with similar model sizes. To this end, we open-source MAP-Neo, a highly capable and transparent bilingual LLM with 7B parameters trained from scratch on 4.5T high-quality tokens. Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs. Moreover, we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided. Finally, we hope our MAP-Neo will enhance and strengthen the open research community and inspire more innovations and creativities to facilitate the further improvements of LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The de-democratization of ai: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581, 2020.
- AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Llemma: An open language model for mathematics.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Cosmopedia, 2024. URL https://huggingface.co/datasets/HuggingFaceTB/cosmopedia.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- Nougat: Neural optical understanding for academic documents, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. ISSN 00063444. URL http://www.jstor.org/stable/2334029.
- Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pp. 21–29. IEEE, 1997.
- Chinesewebtext: Large-scale high-quality chinese web text extracted with effective evaluation model, 2023a.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b.
- Agent-flan: Designing data and methods of effective agent tuning for large language models. arXiv preprint arXiv:2403.12881, 2024.
- Language models as science tutors. arXiv preprint arXiv: 2402.11111, 2024.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Open innovation and within-industry diversification in small and medium enterprises: The case of open source software firms. Research policy, 43(5):891–902, 2014.
- Together Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Data colonialism: Rethinking big data’s relation to the contemporary subject. Television & New Media, 20(4):336–349, 2019.
- DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. URL https://github.com/deepseek-ai/DeepSeek-LLM.
- Composerx: Multi-agent symbolic music composition with llms. arXiv preprint arXiv:2404.18081, 2024.
- Chinese tiny llm: Pretraining a chinese-centric large language model, 2024.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- Identifying and characterizing highly similar notes in big clinical note datasets. Journal of biomedical informatics, 82:63–69, 2018.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Similarity search in high dimensions via hashing. In Vldb, volume 99, pp. 518–529, 1999.
- Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
- Deduplication of scholarly documents using locality sensitive hashing and word embeddings. 2020.
- Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models, 2023.
- Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset, 2022. URL https://arxiv.org/abs/2207.00220.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 4083–4091, 2022.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
- Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Fasttext.zip: Compressing text classification models. arXiv: Computation and Language,arXiv: Computation and Language, Nov 2016.
- Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Connected components in mapreduce and beyond. In Proceedings of the ACM Symposium on Cloud Computing, pp. 1–13, 2014.
- The stack: 3 tb of permissively licensed source code. Preprint, 2022.
- Hdltex: Hierarchical deep learning for text classification. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on. IEEE, 2017.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
- Pp-structurev2: A stronger document analysis system. arXiv preprint arXiv:2210.05391, 2022.
- Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023a.
- From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URL https://lmsys.org/blog/2024-04-19-arena-hard/.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
- Storm-7b, April 2024. URL https://huggingface.co/jieliu/Storm-7B.
- Alignbench: Benchmarking chinese alignment of large language models, 2023b.
- Llm360: Towards fully transparent open-source llms, 2023c.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Starcoder 2 and the stack v2: The next generation, 2024.
- Yayi 2: Multilingual open-source large language models. arXiv preprint arXiv:2312.14862, 2023.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Nam Pham. tiny-strange-textbooks (revision 6f304f1), 2024. URL https://huggingface.co/datasets/nampdn-ai/tiny-strange-textbooks.
- Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400, 2023.
- Openwebmath: An open dataset of high-quality mathematical web text, 2023.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
- Mupt: A generative symbolic music pretrained transformer. arXiv preprint arXiv:2404.06393, 2024.
- Scaling language models: Methods, analysis & insights from training gopher, 2022.
- Direct preference optimization: Your language model is secretly a reward model, 2023.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv e-prints, art. arXiv:1910.10683, October 2019. doi: 10.48550/arXiv.1910.10683.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Paola Ricaurte. Data epistemologies, the coloniality of power, and resistance. Television & New Media, 20(4):350–365, 2019.
- Ronsor. Bigknow2022: Bringing language models up to speed. https://github.com/RyokoAI/BigKnow2022, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Design and implementation of the sun network filesystem. In Proceedings of the summer 1985 USENIX conference, pp. 119–130, 1985.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. working paper or preprint, November 2023. URL https://inria.hal.science/hal-03850124.
- Proximal policy optimization algorithms, 2017.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906.03741.
- Democratizing llms: An exploration of cost-performance trade-offs in self-refined open-source models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
- Noam Shazeer. Glu variants improve transformer, 2020.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.00159.
- Open innovation practices in smes and large enterprises. Small business economics, 41:537–562, 2013.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
- Teknium. Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023. URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
- Culturay: A large cleaned multilingual dataset of 75 languages, 2024.
- Llama: Open and efficient foundation language models. ARXIV, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023b.
- Attention is all you need, 2023.
- Not just bigger: Towards better-quality web corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7), pp. 44–52, 2012.
- Weaver: Foundation models for creative writing. arXiv preprint arXiv: 2401.17268, 2024a.
- Mmlu-pro: Towards more robust and challenging multi-task language understanding evaluation. Manuscript in preparation, 2024b.
- Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023a.
- Skywork: A more open bilingual foundation model, 2023b.
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp. 1–10, 2022.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023.
- Llm agents for psychology: A study on gamified assessments. arXiv preprint arXiv: 2402.12326, 2024.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391, 2024.
- Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153, 2024.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
- Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548, 2024.
- Resilient distributed datasets: A {{\{{Fault-Tolerant}}\}} abstraction for {{\{{In-Memory}}\}} cluster computing. In 9th USENIX symposium on networked systems design and implementation (NSDI 12), pp. 15–28, 2012.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Root mean square layer normalization, 2019.
- Chinese open instruction generalist: A preliminary release. arXiv preprint arXiv:2304.07987, 2023a.
- Don’t trust chatgpt when your question is not in english: A study of multilingual abilities and types of llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7915–7927, 2023b.
- Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024a.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024b.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.
- Structlm: Towards building generalist models for structured knowledge grounding, 2024a.
- Chuxin: 1.6 b technical report. arXiv preprint arXiv:2405.04828, 2024b.