MyanBERTa: A Pre-trained Language Model For Myanmar

Model Description

This model is a BERT based Myanmar pre-trained language model. MyanBERTa has been pre-trained for 528K steps on a word segmented Myanmar dataset consisting of 5,992,299 sentences (136M words). As the tokenizer, byte-leve BPE tokenizer of 30,522 subword units which is learned after word segmentation is applied.

Cite this work as:
Aye Mya Hlaing, Win Pa Pa, "MyanBERTa: A Pre-trained Language Model For Myanmar", In Proceedings of 2022 International Conference on Communication and Computer Research (ICCR2022), November 2022, Seoul, Republic of Korea


The model is available on the  Hugging Face Hub

Download Paper

26 Jul., 2022