Skip to content

allenai/c4

Text GenerationFill MaskAF, AM, ARodc-by

Allenai/c4 is a text generation dataset in AF, AM, AR from allenai with 1,837,702,356 records in Parquet format. It is distributed under the odc-by license and falls in the 10B<n<100B size category, and has been downloaded 1M times.

About allenai/c4

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of th...

Details

Task
Text Generation, Fill Mask
Language
AF, AM, AR
Format
Parquet
Rows / instances
1837702356
Size
10B<n<100B
Creator
allenai
Year
2026
License
odc-by
Downloads
1029647
Likes
601
Download Homepage

Related Text Generation, Fill Mask datasets

FAQ