Myanmar Word Segmentation

Word Segmentation is not a trivial task for Myanmar text, same as other Asian languages, as it doesn’t contain white space to delimit the words like English. It is also essential for every language as it is the fundamental step for linguistic processing.

It may also be necessary to allow multiple correct segmentations of the same text, depending on the requirements of further Natural Language Processing steps, such as Machine Translation from Myanmar to other languages.

A combined model, bigram and word juncture is used in this system.

It is necessary for high level language analysis including name entity recognition and syntactic parsing that are used in many Natural Language Processing (NLP) applications such as machine translation system.

Myanmar Word Segmentation Version 1.0 works by longest matching and bigram method with a pre-segmented corpus of 50,000 words. This version 1.0 does not include Name Entity Recognition. It was launched on August 3, 2011.