Fill Mask Datasets
There are 9 fill mask datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.
Fill Mask is the task of predicting masked-out words in a sentence, the pre-training objective behind models like BERT. We catalog 9 datasets for it.
Updated June 2026
- legacy-datasets/mc4Text Generation, Fill MaskAF, AM, AR
- uonlp/CulturaXText Generation, Fill MaskAF, ALS, AM
- defunct-datasets/the_pile_books3Text Generation, Fill MaskEN
- JeanKaddour/minipileText Generation, Fill MaskEN
- proj-persona/PersonaHubText Generation, Text Classification, Token Classification, Fill Mask, Table Question AnsweringEN, ZH
- wikimedia/wikisourceText Generation, Fill MaskAR, AS, AZ
- JanosAudran/financial-reports-secFill Mask, Text ClassificationEN
- defunct-datasets/amazon_us_reviewsSummarization, Text Generation, Fill Mask, Text ClassificationEN
- facebook/kilt_tasksFill Mask, Question Answering, Text Classification, Text Generation, Text RetrievalEN