commoncrawl/host-index-testing-v2
Text GenerationEnglish
Commoncrawl/host-index-testing-v2 is a text generation-focused dataset in English distributed in Parquet format. And falls in the 10B<n<100B size category, and has been downloaded 11.8K times.
About commoncrawl/host-index-testing-v2
Common Crawl Host Index v2
GitHub: https://github.com/commoncrawl/cc-host-index
Each crawl, we generate a Host Index, which aggregates information about each web hosted visited during the crawl. The
information is aggregated from the Common Cr...
Details
- Task
- Text Generation
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 10B<n<100B
- Creator
- commoncrawl
- Year
- 2025
- Downloads
- 11796
- Likes
- 0