Here’s a list of datasets that we use in our papers.
Sentence Entailment Dataset in Filipino
First benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing. Originally published in (Cruz et al., 2021).
Large Scale Unlabeled Corpora in Filipino
Text dataset with 39 Million tokens in the training split. The TL infix refers to Tagalog1. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). Published in Cruz & Cheng (2019).
Fake News Filipino Dataset Low-Resource Fake News Detection Corpora in Filipino
The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake. Published in Cruz et al. (2020).
Hate Speech Dataset Text Classification Dataset in Filipino
Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019). Published in Cruz & Cheng (2020).
Filipino Dengue Dataset
Low-Resource Multiclass Text Classification Dataset in Filipino
Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018). Published in Cruz & Cheng (2020).
We provide a number of pretrained models in Filipino that you can use for your projects.
Tagalog ELECTRA models that we used for Cruz et al. (2021) to test the NewsPH-NLI dataset. Our models are available in HuggingFace. Use the discriminator models for downstream tasks, unless you need a mask-filling model, in which case you should use the generator models.
Tagalog BERT models for general purpose finetuning. The checkpoints available on HuggingFace are the improved models used in Cruz & Cheng (2020). Older versions of the models used in Cruz et al. (2019) and Cruz & Cheng (2019) are available upon request.
The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39.
Here’s a list of other models used for demos, presentations, or proofs-of-concept.
roberta-tagalog-small– Demo version of RoBERTa trained with WikiText-TL-39.
electra-tagalog-small-uncased-discriminator-newsphnli– ELECTRA model finetuned on NewsPH-NLI for demo purposes.
Checkpoints for translation systems used for WMT submissions and for research papers are available here.
It is worth clearing up that the difference between “Filipino” and “Tagalog” is more sociopolitical than sociolinguistic. Given that the Philippines is home to 7000+ spoken languages, Commonwealth Act No. 184 of 1936 created a national committee whose purpose is to “develop a national language” for the country. The result of this is what is knows as “Filipino,” which is based on the Tagalog language. In practice, Filipino is identical to Tagalog, with the addition of letters f, j, c , x, and z, plus loanwords. For more information, see this link. ↩