Skip to content

Language Modeling Datasets

There are 3 language modeling datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.

Language Modeling is the task of predicting the next token in a sequence — the core pre-training objective behind every LLM. We catalog 3 datasets for it.

Updated June 2026

What languages do language modeling datasets cover?

Explore other dataset tasks

Frequently asked questions