What are Text Corpora datasets used for?

Text Corpora datasets are collections of labelled or raw data used to train, fine-tune, and evaluate models on the text corpora task. This page lists 154 such datasets, each linking to its source and paper.

Which Text Corpora dataset is best for benchmarking?

5 of these Text Corpora datasets are tracked as benchmarks, including The Semantic Scholar Open Research Corpus (S2ORC). See the Benchmarks section for model leaderboards.

How many Text Corpora datasets are there?

We catalog 154 Text Corpora datasets in one searchable directory.

Text Corpora Datasets

There are 154 text corpora datasets in our directory, 5 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.

Text Corpora is a machine-learning task covered in our directory. We catalog 154 datasets for it.

Updated June 2026

What languages do text corpora datasets cover?

English datasets (25)Multi-Lingual datasets (10)Arabic datasets (6)Portuguese datasets (4)Zulu datasets (2)Finnish datasets (2)Catalan datasets (2)Spanish datasets (2)Vietnamese datasets (2)Afrikaans datasets (1)Amharic datasets (1)Azerbaijani datasets (1)Belarusian datasets (1)Bulgarian datasets (1)Bengali datasets (1)Breton datasets (1)Bosnian datasets (1)Czech datasets (1)

Explore other dataset tasks

General NLP(297)Text Generation(137)Question Answering(130)Classification(45)Reading Comprehension(43)Text Classification(33)Machine Translation(25)Sentiment Analysis(21)Dialogue(21)Visual Question Answering(21)Text To Image(20)Image To Text(18)

What languages do text corpora datasets cover?

Explore other dataset tasks

Frequently asked questions

What are Text Corpora datasets used for?

Which Text Corpora dataset is best for benchmarking?

How many Text Corpora datasets are there?