anisoleai/fineweb-tokenized
Text GenerationENodc-by
Anisoleai/fineweb-tokenized is a text generation-focused dataset in EN distributed in Parquet format. It is distributed under the odc-by license and falls in the n>1T size category, and has been downloaded 150.8K times.
About anisoleai/fineweb-tokenized
FineWeb Tokenized
> 4 trillion tokens of the pre-tokenized data the 🌐 web has to offer
What is it?
This is a pre-tokenized version of the HuggingFaceFW/fineweb dataset (currently in-progress, tokenization of the ~15 trill...
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- n>1T
- Creator
- anisoleai
- Year
- 2026
- License
- odc-by
- Downloads
- 150849
- Likes
- 2