## Datasets

Here’s a list of datasets that we use in our papers.

## Pretrained Transformers

We provide a number of pretrained models in Filipino that you can use for your projects.

#### RoBERTa Models

Tagalog RoBERTa models that we release as part of Cruz & Cheng (2021) as an improvement over our previous BERT and ELECTRA models. Trained using the TLUnified dataset with larger scale and more topical-variety compared to WikiText-TL-39. Our models are available in HuggingFace. bibtex

#### ELECTRA Models

Tagalog ELECTRA models that we used for Cruz et al. (2021) to test the NewsPH-NLI dataset. Our models are available in HuggingFace. Use the discriminator models for downstream tasks, unless you need a mask-filling model, in which case you should use the generator models. bibtex

#### BERT Models

Tagalog BERT models for general purpose finetuning. The checkpoints available on HuggingFace are the improved models used in Cruz & Cheng (2020). Older versions of the models used in Cruz et al. (2019) and Cruz & Cheng (2019) are available upon request. bibtex

#### GPT-2 Models

The Tagalog GPT-2 model used to benchmark our fake news detection system Cruz et al. (2020). We make available an improved version of our GPT-2 model trained with NewsPH in addition to WikiText-TL-39. bibtex

#### Others

Here’s a list of other models used for demos, presentations, or proofs-of-concept.

## Translation Models

Checkpoints for translation systems used for WMT submissions and for research papers are available here.

Coming Soon

1. It is worth clearing up that the difference between “Filipino” and “Tagalog” is more sociopolitical than sociolinguistic. Given that the Philippines is home to 7000+ spoken languages, Commonwealth Act No. 184 of 1936 created a national committee whose purpose is to “develop a national language” for the country. The result of this is what is known as “Filipino,” which is based on the Tagalog language. In practice, Filipino is identical to Tagalog, with the addition of letters f, j, c , x, and z, plus loanwords. For more information, see this link