Can Perplexity Predict Fine-tuning Performance? An Investigation Of Tokenization Effects On Sequential Language Models For Nepali

Luitel Nishant, Bekoju Nirajan, Sah Anand Kumar, Shakya Subarna. Arxiv 2024

[Paper]
BERT Fine Tuning GPT Model Architecture Pretraining Methods RAG Tokenization Training Techniques Transformer

Recent language models use subwording mechanisms to handle Out-of-Vocabulary(OOV) words seen during test time and, their generation capacity is generally measured using perplexity, an intrinsic metric. It is known that increasing the subword granularity results in a decrease of perplexity value. However, the study of how subwording affects the understanding capacity of language models has been very few and only limited to a handful of languages. To reduce this gap we used 6 different tokenization schemes to pretrain relatively small language models in Nepali and used the representations learned to finetune on several downstream tasks. Although byte-level BPE algorithm has been used in recent models like GPT, RoBERTa we show that on average they are sub-optimal in comparison to algorithms such as SentencePiece in finetuning performances for Nepali. Additionally, similar recent studies have focused on the Bert-based language model. We, however, pretrain and finetune sequential transformer-based language models.

The Large Language Model Bible

Can Perplexity Predict Fine-tuning Performance? An Investigation Of Tokenization Effects On Sequential Language Models For Nepali

Luitel Nishant, Bekoju Nirajan, Sah Anand Kumar, Shakya Subarna. Arxiv 2024

Similar Work