There is a vast amount of textual data available from various sources such as websites, news articles, and academic manuscripts, and it can be easily accessed through datasets like the common crawl, which is almost 541TB in size. However, training large language models (LLMs) from scratch on these large datasets can be computationally infeasible in many settings. While there have been efforts to make LLM training more accessible, there has not yet been much exploration into the possibility of effectively distilling large-scale textual data as a solution. While we find that the dataset distillation approaches proposed in the vision community might be ineffective in the LLM setting, we hope that our work will help motivate 1) theoretical investigations into the minimum amount of information required to train performative LLMs, 2) investigations into understand the compositionality of the compact linguistic knowledge, 3) investigations into other more effective dataset distillation methods.

presentation

report