MyanBERTa: A Pre-trained Language Model For Myanmar
Model Description
This model is a BERT based Myanmar pre-trained language model. MyanBERTa has been pre-trained for 528K steps on a word segmented Myanmar dataset consisting of 5,992,299 sentences (136M words). As the tokenizer, byte-leve BPE tokenizer of 30,522 subword units which is learned after word segmentation is applied.
Cite this work as:
Aye Mya Hlaing, Win Pa Pa, "MyanBERTa: A Pre-trained Language Model For
Myanmar", In Proceedings of 2022 International Conference on Communication and Computer Research (ICCR2022), November 2022, Seoul, Republic of Korea
Download
The model is available on the Hugging Face Hub
26 Jul., 2022