Blaise Cruz | Resources

Datasets

Here’s a list of datasets that we use in our papers.

TLUnified Large Scale Corpus
Large Scale Unlabeled Corpora in Filipino
Pretraining corpus constructed as an improvement over the older WikiText-TL-39 in terms of both scale and topic variety. Originally published in Cruz & Cheng (2021).
download bibtex
NewsPH-NLI Dataset
Sentence Entailment Dataset in Filipino
First benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing. Originally published in (Cruz et al., 2021).
download bibtex
WikiText-TL-39
Large Scale Unlabeled Corpora in Filipino
Text dataset with 39 Million tokens in the training split. The TL infix refers to Tagalog¹. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). Published in Cruz & Cheng (2019).
download bibtex
Fake News Filipino Dataset Low-Resource Fake News Detection Corpora in Filipino
The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake. Published in Cruz et al. (2020).
download bibtex
Hate Speech Dataset Text Classification Dataset in Filipino
Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019). Published in Cruz & Cheng (2020).
download (raw) download (processed) bibtex
Filipino Dengue Dataset
Low-Resource Multiclass Text Classification Dataset in Filipino
Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018). Published in Cruz & Cheng (2020).
download (raw) download (processed) bibtex

Pretrained Transformers

We provide a number of pretrained models in Filipino that you can use for your projects.

RoBERTa Models

Tagalog RoBERTa models that we release as part of Cruz & Cheng (2021) as an improvement over our previous BERT and ELECTRA models. Trained using the TLUnified dataset with larger scale and more topical-variety compared to WikiText-TL-39. Our models are available in HuggingFace. bibtex

ELECTRA Models

Tagalog ELECTRA models that we used for Cruz et al. (2021) to test the NewsPH-NLI dataset. Our models are available in HuggingFace. Use the discriminator models for downstream tasks, unless you need a mask-filling model, in which case you should use the generator models. bibtex

BERT Models

Tagalog BERT models for general purpose finetuning. The checkpoints available on HuggingFace are the improved models used in Cruz & Cheng (2020). Older versions of the models used in Cruz et al. (2019) and Cruz & Cheng (2019) are available upon request. bibtex

GPT-2 Models

The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39. bibtex

gpt2-tagalog

Others

Here’s a list of other models used for demos, presentations, or proofs-of-concept.

roberta-tagalog-small – Demo version of RoBERTa trained with WikiText-TL-39.
electra-tagalog-small-uncased-discriminator-newsphnli – ELECTRA model finetuned on NewsPH-NLI for demo purposes.

Translation Models

Checkpoints for translation systems used for WMT submissions and for research papers are available here.

Coming Soon

It is worth clearing up that the difference between “Filipino” and “Tagalog” is more sociopolitical than sociolinguistic. Given that the Philippines is home to 7000+ spoken languages, Commonwealth Act No. 184 of 1936 created a national committee whose purpose is to “develop a national language” for the country. The result of this is what is known as “Filipino,” which is based on the Tagalog language. In practice, Filipino is identical to Tagalog, with the addition of letters f, j, c , x, and z, plus loanwords. For more information, see this link. ↩