Well-read Students Learn Better: On The Importance Of Pre-training Compact Models · The Large Language Model Bible Contribute to LLM-Bible

Well-read Students Learn Better: On The Importance Of Pre-training Compact Models

Turc Iulia, Chang Ming-wei, Lee Kenton, Toutanova Kristina. Arxiv 2019

[Paper]    
BERT Distillation Efficiency And Optimization Fine Tuning Model Architecture Pretraining Methods Quantization RAG Reinforcement Learning Training Techniques

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

Similar Work