Wanjuan: A Comprehensive Multimodal Dataset For Advancing English And Chinese Large Models · The Large Language Model Bible Contribute to LLM-Bible

Wanjuan: A Comprehensive Multimodal Dataset For Advancing English And Chinese Large Models

He Conghui, Jin Zhenjiang, Xu Chao, Qiu Jiantao, Wang Bin, Li Wei, Yan Hang, Wang Jiaqi, Lin Dahua. Arxiv 2023

[Paper]    
Ethics And Bias GPT Model Architecture Multimodal Models Training Techniques

The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the development of large models, leading to the creation of numerous impressive large language models(LLMs) and multimodal large language models (MLLMs). These cutting-edge models owe their remarkable performance to high-quality data. However, the details of the training data used in leading paradigms are often kept confidential. This lack of transparency, coupled with the scarcity of open-source data, impedes further developments within the community. As a response, this paper presents “Wan Juan”, a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale. All data can be accessed at https://opendatalab.org.cn/WanJuan1.0.

Similar Work