Skip to content

codeparrot/github-code-clean

General NLPEnglishapache-2.0

The codeparrot/github-code-clean dataset is a English General NLP resource from codeparrot at 2022. With 45.8K downloads and 142 likes, it is actively used by the community. It is released under the apache-2.0 license and is a 10M<n<100M-scale dataset.

About codeparrot/github-code-clean

The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Size
10M<n<100M
Creator
codeparrot
Year
2022
License
apache-2.0
Downloads
45842
Likes
142
Download Homepage

Related General NLP datasets

FAQ