Symato/cc
General NLPVI
Symato/cc is a General NLP dataset in VI from Symato in Parquet format.
About Symato/cc
What is Symato CC?
To download all WARC data from Common Crawl then filter out Vietnamese in Markdown and Plaintext format.
There is 1% of Vietnamse in CC, extract all of them out should be a lot (~10TB of plaintext).
Main contributors
...
Details
- Task
- General NLP
- Language
- VI
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- Symato
- Year
- 2023