Skip to content

codeparrot/codeparrot-clean

General NLPEnglish

Created by codeparrot at 2022, the codeparrot/codeparrot-clean is a General NLP dataset in English in Parquet format. With 44.1K downloads and 88 likes, it is actively used by the community and is a 1M<n<10M-scale dataset.

About codeparrot/codeparrot-clean

CodeParrot 🦜 Dataset Cleaned What is it? A dataset of Python files from Github. This is the deduplicated version of the codeparrot. Processing The original dataset contains a lot of duplicated and noisy data. Therefore, th...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Size
1M<n<10M
Creator
codeparrot
Year
2022
Downloads
44117
Likes
88
Download Homepage

Related General NLP datasets

FAQ