Skip to content

CC Net

Text CorporaMulti-Lingual

CC Net is a text corpora dataset in Multi-Lingual from Wenzek et al. with A LOT! records in JSON format.

About CC Net

Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.

Details

Task
Text Corpora
Language
Multi-Lingual
Format
JSON
Rows / instances
A LOT!
Creator
Wenzek et al.
Year
2019
Download Paper

Related Text Corpora datasets

FAQ