Question 1

What is the CC Net dataset?

Accepted Answer

Dataset of the common crawl corpus that has been cleaned and deduplicated. This pipeline preserves the structure of documents and filter the data based on their distance to Wikipedia.

Question 2

Is CC Net a benchmark?

Accepted Answer

CC Net is a dataset for training or evaluation; it isn't tracked as a standard LLM benchmark in our catalog.

Question 3

Where can I download CC Net?

Accepted Answer

CC Net is available at its source: https://github.com/facebookresearch/cc_net.

CC Net

About CC Net

Details

Related Text Corpora datasets

FAQ