Skip to content

legacy-datasets/mc4

Text GenerationFill MaskAF, AM, AR

Legacy-datasets/mc4 is a text generation-focused dataset in AF, AM, AR distributed in Parquet format.

About legacy-datasets/mc4

A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.

Details

Task
Text Generation, Fill Mask
Language
AF, AM, AR
Format
Parquet
Rows / instances
N/A
Creator
legacy-datasets
Year
2022
Download

Related Text Generation, Fill Mask datasets

FAQ