Open-sourced datasets, pretrained transformers, and translation systems.


Here’s a list of datasets that we use in our papers.

  • TLUnified Large Scale Corpus
    Large Scale Unlabeled Corpora in Filipino
    Pretraining corpus constructed as an improvement over the older WikiText-TL-39 in terms of both scale and topic variety. Originally published in Cruz & Cheng (2021).
    download   bibtex

  • NewsPH-NLI Dataset
    Sentence Entailment Dataset in Filipino
    First benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing. Originally published in (Cruz et al., 2021).
    download   bibtex

  • WikiText-TL-39
    Large Scale Unlabeled Corpora in Filipino
    Text dataset with 39 Million tokens in the training split. The TL infix refers to Tagalog1. Inspired by the original WikiText Long Term Dependency dataset (Merity et al., 2016). Published in Cruz & Cheng (2019).
    download   bibtex

  • Fake News Filipino Dataset  Low-Resource Fake News Detection Corpora in Filipino
    The first of its kind. Contains 3,206 expertly-labeled news samples, half of which are real and half of which are fake. Published in Cruz et al. (2020).
    download   bibtex

  • Hate Speech Dataset  Text Classification Dataset in Filipino
    Contains 10k tweets (training set) that are labeled as hate speech or non-hate speech. Released with 4,232 validation and 4,232 testing samples. Collected during the 2016 Philippine Presidential Elections and originally used in Cabasag et al. (2019). Published in Cruz & Cheng (2020).
    download (raw)   download (processed)   bibtex

  • Filipino Dengue Dataset 
    Low-Resource Multiclass Text Classification Dataset in Filipino
    Benchmark dataset for low-resource multiclass classification, with 4,015 training, 500 testing, and 500 validation examples, each labeled as part of five classes. Each sample can be a part of multiple classes. Collected as tweets and originally used in Livelo & Cheng (2018). Published in Cruz & Cheng (2020).
    download (raw)   download (processed)   bibtex

Pretrained Transformers

We provide a number of pretrained models in Filipino that you can use for your projects.

RoBERTa Models

Tagalog RoBERTa models that we release as part of Cruz & Cheng (2021) as an improvement over our previous BERT and ELECTRA models. Trained using the TLUnified dataset with larger scale and more topical-variety compared to WikiText-TL-39. Our models are available in HuggingFace. bibtex


Tagalog ELECTRA models that we used for Cruz et al. (2021) to test the NewsPH-NLI dataset. Our models are available in HuggingFace. Use the discriminator models for downstream tasks, unless you need a mask-filling model, in which case you should use the generator models. bibtex

BERT Models

Tagalog BERT models for general purpose finetuning. The checkpoints available on HuggingFace are the improved models used in Cruz & Cheng (2020). Older versions of the models used in Cruz et al. (2019) and Cruz & Cheng (2019) are available upon request. bibtex

GPT-2 Models

The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39. bibtex


Here’s a list of other models used for demos, presentations, or proofs-of-concept.

Translation Models

Checkpoints for translation systems used for WMT submissions and for research papers are available here.

Coming Soon

  1. It is worth clearing up that the difference between “Filipino” and “Tagalog” is more sociopolitical than sociolinguistic. Given that the Philippines is home to 7000+ spoken languages, Commonwealth Act No. 184 of 1936 created a national committee whose purpose is to “develop a national language” for the country. The result of this is what is known as “Filipino,” which is based on the Tagalog language. In practice, Filipino is identical to Tagalog, with the addition of letters f, j, c , x, and z, plus loanwords. For more information, see this link