Text Corpora Datasets
There are 154 text corpora datasets in our directory, 5 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.
Text Corpora is a machine-learning task covered in our directory. We catalog 154 datasets for it.
Updated June 2026
- SigmaLaw-ABSAText Corpora, Sentiment AnalysisEnglish
- CC100Text CorporaZulu
- Historical Portuguese Corpora (HPC)Text Corpora, Text ClassificationPortuguese
- ITD - Dataset de Acordãos do STF de 2010 a 2018Text CorporaPortuguese
- FI News CorpusText CorporaFinnish
- Lex2KidsText CorporaPortuguese
- CC100-AfrikaansText CorporaAfrikaans
- CC100-AmharicText CorporaAmharic
- CC100-ArabicText CorporaArabic
- CC100-AzerbaijaniText CorporaAzerbaijani
- CC100-BelarusianText CorporaBelarusian
- CC100-BulgarianText CorporaBulgarian
- CC100-BengaliText CorporaBengali
- CC100-BretonText CorporaBreton
- CC100-BosnianText CorporaBosnian
- CC100-CatalanText CorporaCatalan
- CC100-CzechText CorporaCzech
- CC100-WelshText CorporaWelsh
- CC100-DanishText CorporaDanish
- CC100-GermanText CorporaGerman
- CC100-GreekText CorporaGreek
- CC100-EnglishText CorporaEnglish
- CC100-EsperantoText CorporaEsperanto
- CC100-SpanishText CorporaSpanish
- CC100-EstonianText CorporaEstonian
- CC100-BasqueText CorporaBasque
- CC100-FinnishText CorporaFinnish
- CC100-FrenchText CorporaFrench
- CC100-FrisianText CorporaFrisian
- CC100-IrishText CorporaIrish
- CC100-Scottish GaelicText CorporaScottish Gaelic
- CC100-GalicianText CorporaGalician
- CC100-GuaraniText CorporaGuarani
- CC100-GujaratiText CorporaGujarati
- CC100-HausaText CorporaHausa
- CC100-HebrewText CorporaHebrew
- CC100-HindiText CorporaHindi
- CC100-Hindi RomanizedText CorporaHindi Romanized
- CC100-CroatianText CorporaCroatian
- CC100-HaitianText CorporaHaitian
- CC100-HungarianText CorporaHungarian
- CC100-ArmenianText CorporaArmenian
- CC100-IndonesianText CorporaIndonesian
- CC100-ItalianText CorporaItalian
- CC100-JapaneseText CorporaJapanese
- CC100-JavaneseText CorporaJavanese
- CC100-GeorgianText CorporaGeorgian
- CC100-KazakhText CorporaKazakh
- CC100-KhmerText CorporaKhmer
- CC100-KannadaText CorporaKannada
- CC100-KoreanText CorporaKorean
- CC100-KurdishText CorporaKurdish
- CC100-KyrgyzText CorporaKyrgyz
- CC100-LatinText CorporaLatin
- CC100-GandaText CorporaGanda
- CC100-LimburgishText CorporaLimburgish
- CC100-LingalaText CorporaLingala
- CC100-LaoText CorporaLao
- CC100-LithuanianText CorporaLithuanian
- CC100-LatvianText CorporaLatvian
- CC100-MacedonianText CorporaMacedonian
- CC100-MalayalamText CorporaMalayalam
- CC100-MongolianText CorporaMongolian
- CC100-MarathiText CorporaMarathi
- CC100-MalayText CorporaMalay
- CC100-BurmeseText CorporaBurmese
- CC100-Burmese (Zawgyi)Text CorporaBurmese (Zawgyi)
- CC100-NepaliText CorporaNepali
- CC100-DutchText CorporaDutch
- CC100-NorwegianText CorporaNorwegian
- CC100-Northern SothoText CorporaNorthern Sotho
- CC100-OromoText CorporaOromo
- CC100-OriyaText CorporaOriya
- CC100-PunjabiText CorporaPunjabi
- CC100-PolishText CorporaPolish
- CC100-PashtoText CorporaPashto
- CC100-PortugueseText CorporaPortuguese
- CC100-QuechuaText CorporaQuechua
- CC100-RomanianText CorporaRomanian
- CC100-RussianText CorporaRussian
- CC100-SanskritText CorporaSanskrit
- CC100-SinhalaText CorporaSinhala
- CC100-SardinianText CorporaSardinian
- CC100-SindhiText CorporaSindhi
- CC100-SlovakText CorporaSlovak
- CC100-SlovenianText CorporaSlovenian
- CC100-SomaliText CorporaSomali
- CC100-AlbanianText CorporaAlbanian
- CC100-SerbianText CorporaSerbian
- CC100-SwatiText CorporaSwati
- CC100-SundaneseText CorporaSundanese
- CC100-SwedishText CorporaSwedish
- CC100-SwahiliText CorporaSwahili
- CC100-TamilText CorporaTamil
- CC100-Tamil RomanizedText CorporaTamil Romanized
- CC100-TeluguText CorporaTelugu
- CC100-ThaiText CorporaThai
- CC100-TagalogText CorporaTagalog
- CC100-TswanaText CorporaTswana
- CC100-TurkishText CorporaTurkish
- CC100-UyghurText CorporaUyghur
- CC100-UkrainianText CorporaUkrainian
- CC100-UrduText CorporaUrdu
- CC100-UzbekText CorporaUzbek
- CC100-VietnameseText CorporaVietnamese
- CC100-WolofText CorporaWolof
- CC100-XhosaText CorporaXhosa
- CC100-YiddishText CorporaYiddish
- CC100-YorubaText CorporaYoruba
- CC100-Chinese (Simplified)Text CorporaChinese (Simplified)
- CC100-Chinese (Traditional)Text CorporaChinese (Traditional)
- CC100-ZuluText CorporaZulu
- CC NetText CorporaMulti-Lingual
- NLP Chinese CorpusText CorporaChinese
- UIT-SPCText CorporaVietnamese
- OpenWebTextCorpusText CorporaEnglish
- ArxivPapersText CorporaEnglish
- NIPS PapersText CorporaEnglish
- Saudi Newspapers CorpusText CorporaArabic
- Igbo TextText Corpora, Machine TranslationIgbo, English
- Urhobo TextText Corpora, Machine TranslationUrhobo, English
- Arabic in Business and Management Corpora (ABMC)Text CorporaArabic
- Leipzig Corpora CollectionText CorporaMulti-Lingual
- BuGLText CorporaEnglish
- NELA-GT-2019Text Corpora, ClassificationEnglish
- Khaleej-2004 CorpusText CorporaArabic
- Watan-2004 CorpusText CorporaArabic
- Parallel Arabic DIalectal Corpus (PADIC)Text CorporaArabic
- Wikipedia News CorpusText CorporaEnglish
- DOGCText Corpora, Machine TranslationCatalan, Spanish
- ECB CorpusText Corpora, Machine TranslationMulti-Lingual
- EubookshopText Corpora, Machine TranslationMulti-Lingual
- WMT 19 Multiple DatasetsText Corpora, Machine TranslationMulti-Lingual
- WikipediaText CorporaEnglish
- Groningen Meaning BankText CorporaEnglish
- Kensho Derived Wikimedia Dataset (KDWD)Text Corpora, Knowledge BaseEnglish
- Parallel Meaning BankText CorporaMulti-Lingual
- Portuguese Newswire CorpusText CorporaPortuguese (Brazil)
- ABC Australia News CorpusText CorporaEnglish
- arXiv Bulk DataText CorporaEnglish
- CommonCrawlText CorporaMulti-Lingual
- EBM PICOText CorporaEnglish
- European Parliament Proceedings (Europarl)Text Corpora, Machine TranslationMulti-Lingual
- Cornell NewsroomText Corpora, SummarizationEnglish
- Enron Email DatasetText CorporaEnglish
- Guttenberg Book CorpusText CorporaMulti-Lingual
- One Week of Global News FeedsText CorporaMulti-Lingual
- Ubuntu Dialogue CorpusText Corpora, DialogueEnglish
- WikiHowText Corpora, SummarizationEnglish
- The Semantic Scholar Open Research Corpus (S2ORC)Text Corpora, Knowledge BaseEnglishBenchmark
- ACL Anthology Reference Corpus (ACL ARC)Text CorporaEnglishBenchmark
- Self-Annotated Reddit Corpus (SARC)Text Corpora, Sarcasm DetectionEnglishBenchmark
- COVID-19 Open Research Dataset (CORD-19)Text CorporaEnglishBenchmark
- Open Research CorpusText CorporaEnglishBenchmark
What languages do text corpora datasets cover?
English datasets (25)Multi-Lingual datasets (10)Arabic datasets (6)Portuguese datasets (4)Zulu datasets (2)Finnish datasets (2)Catalan datasets (2)Spanish datasets (2)Vietnamese datasets (2)Afrikaans datasets (1)Amharic datasets (1)Azerbaijani datasets (1)Belarusian datasets (1)Bulgarian datasets (1)Bengali datasets (1)Breton datasets (1)Bosnian datasets (1)Czech datasets (1)