Abstract:
Today we live in a world where there are more multilingual than monolingual. There is an ever-
increasing amount of information on the world wide web that is written in different languages.
Ethiopia is a multilingual country par excellence and multiple languages are used as media of
administration, education and mass communications. But, these textual contents may not be
expressed in a monolithic format. To use such textual resources for various purposes, language
identification (LID) is an important preprocessing task for understanding, organizing and
analyzing these contents. LID is the detection of the natural language of an input text. It is also the
first necessary step to do any language-dependent natural language processing tasks. Although
text-based LID has been extensively studied, there is still no comprehensive understanding of the
factors that determine its identification accuracy. Factors such as the size of the text fragment to
be identified, the amount and variety of training data available, the classification algorithm and
the embedding techniques used. LID in very closely related languages is another unsolved problem.
Current LID applications and models are unable to accurately identify the language for given text
written in the Ge’ez script due to their similarity. The Ethiopic script is an alpha-syllabary or
abugida “
አቡጊዳ” writing system used for several languages spoken in Ethiopia and Eritrea.
In this work, we presented a LID model for six typologically and phylogenetically related low-
resourced Ethiopian languages that use the Ge’ez script as their writing system; namely Amharic,
Awngi, Ge'ez, Guragigna, Tigrigna and Xamtanga. The corpus used was collected automatically
from various sources including Ethiopian mass media websites, social media, Bibles and related
publications. We used the chars2vec embedding technique as a feature and DNN model for
classification. To train and evaluate the proposed LID model, the researchers conducted several
experiments with sample texts of different lengths using the best hyperparameter setting. Finally,
the proposed LID model correctly identified the languages with an accuracy of more than 99% for
texts longer than 50 characters and an accuracy of 77.68% for texts 5 characters long. The
developed model also performed well for the out-of-vocabulary texts. In cases where languages
are closely related and texts are very short, the identification performance of the proposed model
was relatively poor. Therefore, it would be of interest to keep exploring LID models that handle
closely related languages with short texts in the future.