Unlocking Emergent Modularity In Large Language Models

Qiu Zihan, Huang Zeyu, Fu Jie. 2023

[Paper] [Code]
Fine Tuning Has Code Model Architecture Pretraining Methods Reinforcement Learning Training Techniques Transformer

Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally \(\textit{explicit}\): their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists \(\textit{implicit}\) modularity in standard pre-trained transformers, namely \(\textit{Emergent Modularity}\). They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at https://github.com/qiuzh20/EMoE.

The Large Language Model Bible

Unlocking Emergent Modularity In Large Language Models

Qiu Zihan, Huang Zeyu, Fu Jie. 2023

Similar Work