← Back to projects

Dataset Distillation for LLMs

COS 597G: Understanding Large Language Models @ Princeton University

Dataset Distillation for LLMs

Training large language models on datasets like Common Crawl (541TB) is computationally infeasible in many settings. We explore distilling large-scale textual data as a solution. While vision-community approaches may be ineffective for LLMs, this work motivates investigations into the minimum information required to train performative LLMs and the compositionality of compact linguistic knowledge.

Report

PDF cannot be displayed. Download the report.

Presentation

PDF cannot be displayed. Download the slides.