Autopuredata: Automated Filtering Of Web Data For LLM Fine-tuning

Vadlapati Praneeth. Arxiv 2024

[Paper]
Ethics And Bias Fine Tuning Pretraining Methods Responsible AI Training Techniques

Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system’s effectiveness in purifying the data.

The Large Language Model Bible

Autopuredata: Automated Filtering Of Web Data For LLM Fine-tuning

Vadlapati Praneeth. Arxiv 2024

Similar Work