Emergent Mind

Albanian Language Identification in Text Documents

(1901.04216)
Published Jan 14, 2019 in cs.IR and cs.CL

Abstract

In this work we investigate the accuracy of standard and state-of-the-art language identification methods in identifying Albanian in written text documents. A dataset consisting of news articles written in Albanian has been constructed for this purpose. We noticed a considerable decrease of accuracy when using test documents that miss the Albanian alphabet letters " \"E " and " \c{C} " and created a custom training corpus that solved this problem by achieving an accuracy of more than 99%. Based on our experiments, the most performing language identification methods for Albanian use a na\"ive Bayes classifier and n-gram based classification features.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.