codeparrot/codeparrot-clean
General NLPEnglish
Created by codeparrot at 2022, the codeparrot/codeparrot-clean is a General NLP dataset in English in Parquet format. With 44.1K downloads and 88 likes, it is actively used by the community and is a 1M<n<10M-scale dataset.
About codeparrot/codeparrot-clean
CodeParrot 🦜 Dataset Cleaned
What is it?
A dataset of Python files from Github. This is the deduplicated version of the codeparrot.
Processing
The original dataset contains a lot of duplicated and noisy data. Therefore, th...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 1M<n<10M
- Creator
- codeparrot
- Year
- 2022
- Downloads
- 44117
- Likes
- 88