Autonomy Stemmer Algorithm for Legal and Illegal Affix Detection use Finite-State Automata Method

Ana Tsalitsatun Ni'mah; Dwi Ari Suryaningrum; Agus Zainal Arifin

doi:10.25042/epi-ije.022019.09

Ana Tsalitsatun Ni'mah Informatics Department, Faculty of Information Technology, Institut Teknologi Sepuluh Nopember
Dwi Ari Suryaningrum Informatics Department, Faculty of Information Technology, Institut Teknologi Sepuluh Nopember
Agus Zainal Arifin Informatics Department, Faculty of Information Technology, Institut Teknologi Sepuluh Nopember

DOI: https://doi.org/10.25042/epi-ije.022019.09

Keywords: Autonomy Stemmer, Confix Stripping Stemmer, Finite State Method, Porter Indonesian Language, Regular Expression, Stemming

Abstract

Stemming is the process of separating words from their affixes to get a basic word. Stemming is generally used when preprocessing in text-based applications. Indonesian Stemming has developed research which is divided into two types, namely, stemming without dictionaries and stemming using dictionaries. Stemming without dictionaries has a disadvantage in the results of removal of affixes which are sometimes inappropriate so that it results in over stemming or under stemming, while stemming using dictionaries has a disadvantage during the stemming process which is relatively long and cannot eliminate affixes to compound words. This study proposes a new stemming algorithm without a dictionary that is able to detect legal and illegal affixes in Indonesian using the Finite-State Automata method. The technique used is rule-based Stemmer based on Indonesian language morphology with Regular Expression. Test results were carried out using 118 news documents with 15792 words. The first test results on the autonomy stemmer algorithm obtain the correct word which amounts to 10449 of the total number of words processed, which means getting an average accuracy of 66%. The second test results on the autonomy stemmer algorithm get the results of the average speed of 0.0051 seconds. The third test result is being able to do the elimination of affixes to compound words.