Natural Language Processing

Navigation

Home
About Us
Contact Us
Release

Since June 22, 2011

MyanBERTa: A Pre-trained Language Model For Myanmar

Model Description

This model is a BERT based Myanmar pre-trained language model. MyanBERTa has been pre-trained for 528K steps on a word segmented Myanmar dataset consisting of 5,992,299 sentences (136M words). As the tokenizer, byte-leve BPE tokenizer of 30,522 subword units which is learned after word segmentation is applied.

Cite this work as:

Aye Mya Hlaing, Win Pa Pa, "MyanBERTa: A Pre-trained Language Model For Myanmar", In Proceedings of 2022 International Conference on Communication and Computer Research (ICCR2022), November 2022, Seoul, Republic of Korea

Download

The model is available on the Hugging Face Hub

Download Paper

26 Jul., 2022