Autonomy Stemmer Algorithm for Legal and Illegal Affix Detection Use Finite-State Automata Method

Stemming is the process of separating words from their affixes to get a basic word. Stemming is generally used when preprocessing in text-based applications. Indonesian Stemming has developed research which is divided into two types, namely, stemming without dictionaries and stemming using dictionaries. Stemming without dictionaries has a disadvantage in the results of removal of affixes which are sometimes inappropriate so that it results in over stemming or under stemming, while stemming using dictionaries has a disadvantage during the stemming process which is relatively long and cannot eliminate affixes to compound words . This study proposes a new stemming algorithm without a dictionary that is able to detect legal and illegal affixes in Indonesian using the Finite-State Automata method. The technique used is rule-based Stemmer based on Indonesian language morphology with Regular Expression. Test results were carried out using 118 news documents with 15792 words. The first test results on the autonomy stemmer algorithm obtain the correct word which amounts to 10449 of the total number of words processed, which means getting an average accuracy of 66%. The second test results on the autonomy stemmer algorithm get the results of the average speed of 0.0051 seconds. The third test result is being able to do the elimination of affixes to compound words.


Introduction
Stemming is a basic word separation process from its affixes based on the morphological mapping of various variants of affixed words [1]. Stemming in informatics is used in text processing which is generally when searching for information, translations, etc. [2][3][4]. Morphology is a very important thing in stemming algorithms. Morphology is a process of forming words [2,5,6]. Words that experience morphology in Indonesian is affixed words, rephrase words and compound words. English only has one type of affix word, suffix, whereas in the morphology of the Indonesian language there are several types of affixes, namely: prefix, insertion, suffix, combined prefix ending, and foreign affixes.
The Indonesian Stemming Algorithm was first developed by Nazief and Adriani [7]. The Stemming algorithm is called Confix Stripping (CS). The Confix Stripping (CS) algorithm performs the affix decapitation process by referring to the Indonesian dictionary at each step. The algorithm was developed again by Arifin and Setiono [8]. The development of this algorithm simplifies the affixing rule. Tala [1] conducted research on stemming Indonesian without using a dictionary [1]. The stemming algorithm refers to the Porter algorithm, that algorithm is a stemming algorithm used in English, Tala applies the algorithm to Indonesian. Furthermore, Putra et al. [7] doing research on various types of stemming in Indonesian. The study presents several Indonesian languages stemming algorithms, including Confix Stripping (Nazief and Adriani), Modified Confix Stripping (Arifin and Setiono), Vega etc. The research was conducted again on Confix Stripping by Adriani et al. [9]. The results of the study concluded that the most stable algorithm for stemming Indonesian at that time was Confix Stripping. Arifin, et al. [10] developed the Confix Stripping Algorithm. The algorithm is named Enhanced Confix Stripping Stemmer (ECS). ECS modifies the rules in Confix Stripping. Apart from some of there are still many more studies on Indonesian stemming [2,3,5,6].
There are some researches on stemming, including carrying out the affixing process with the Brute Force technique/table lookup or stemming based dictionary, there are also some that use affix removal techniques. The basic research of Indonesian stemming with affix removal technique is the Indonesian Porter stemming algorithm [1], while the research basis for a dictionary-based stemming technique is Confix Stripping [7]. The development of the Porter stemming algorithm for Indonesian has been compared with the Confix Stripping stemming algorithm [12]. The study, developed by Agusta [12], has mentioned several comparisons. Among other *Corresponding author. Tel.: +62 857 3612 4000 Jl. Raya Telang -Kamal Bangkalan, Indonesia Postcode 69162 things, the Porter Stemming Algorithm process takes a shorter time than the Confix Stripping algorithm, the Porter stemming algorithm has a smaller accuracy compared to Confix Stripping Algorithm with an average difference of 20%. The process of Confix Stripping dictionary stemming algorithm is very influential on stemming results, the more complete the dictionary is used, the more accurate stemming results will be. According to Tahitoe and Purwitasari's research [11] the Enhanced Confix Stripping algorithm that they developed still lacked that is unable to stem compound words. According to Widjaja and Hansun research [6], the Indonesian Porter stemming algorithm also has its drawbacks, namely over stemming and under stemming. This, of course, will reduce the efficiency and performance of the stemming algorithm [6].
This study proposes a new stemming algorithm without a dictionary that is able to detect legal and illegal affixes in Indonesian using the Finite-State Automata method. The purpose of this study is to get stemming results that have high accuracy and speed by not relying on dictionaries during the removal process so that they can do the elimination of affixes to compound words.

Finite-state automata and regular expression
Automata is a process sequence that automatically receives input and produces discrete output. The input circuit received is a string or language that is recognized by automata. If the input circuit is received and recognized, the engine produces output [5].
Finite-State Machine is an abstract machine in the form of mathematical theory by getting discrete outputs and inputs during the process that can recognize the simplest language (regular language) and can be implemented significantly where the system in an internal configuration called a state [5]. FSM works by means of the machine reading the input memory in the form of a tape, which is 1 character at a time (from left to right) using a read head which is controlled by a finite state control box where there are a number of finite states on the machine. The FSM is always in a condition called the initial state when starting to read a tape. State changes occur on the machine when the next character is read. When the head arrives at the end of the tape and the condition encountered is the final state, then the string contained on the tape is said to be received by FSM (Strings are the property of the language if the FSM language is accepted). FSM is stated simply by the regular expression language.
Regular expressions or often referred to as Regex are formulas for searching patterns of sentences or strings. Regex is very helpful in finding sentence patterns. So experiments with all possible sentence patterns need not be done. Regular expressions are generally used by many word processors or text editors and other tools to search for and manipulate sentences based on a certain pattern. At low levels, the regex can search for a word fragment. At a high level, the regex is able to control the data. Both searching, deleting and changing [5].

Stemming
The Stemming method can be classified into 3 techniques, namely [13]: rule-based, statistical, and hybrid can be seen in Fig. 1.

Rule-based stemmer
This stemmer is a more accurate stemmer compared to other stemmer techniques because this technique pays attention to the language rules in the stemming process. Stemmer is categorized into 3, namely: Brute Force method, Affix Removal method, and morphology method.
The Brute Force method is also known as the Table  lookup techniques which is a stemming process carried out on the basis of a search table that contains a collection of basic words or basic word dictionaries.
Affix removal method is to delete the ending or prefix of words so that they turn them into basic words. Most stemmers currently used use this type of approach. The Affix removal method is based on two principles namely iteration and the other is the longest match [14]. This method starts at the end of the word and works towards the beginning. No more than one process is permitted in one class the deletion process. Some stemming algorithms that use this approach are Lovins and MF Porter [14]. In Basaha Indonesia, the stemming algorithm that uses this technique is the Indonesian Porter. The recharge process only occurs once every time the process.
Morphological methods are stemming techniques that use the language morphology rules in the process of eliminating affixes. This method allows the simultaneous removal of affixes in one deletion process, in contrast to the affix removal method.

Statistical stemmer
This Lexicon technique is a technique that groups words according to similarity. The process of stemming is done by finding the closest distance from the meaning of the word that has been collected. Corpus techniques are similar to the Lexicon Technique, the difference is that if Lexicon collects words based on meaning, the corpus collects morphologically or similarly written words.
The N-gram method was coined in 1974 by Adamson and Boreham. N-grams come from grams that are more than 2 or digram. A digram is a pair of consecutive letters [14]. This approach, linking the pair's words on the basis of the unique digram both have. To calculate this measurement using the Dice coefficient. For example, the term information and informative can enter into grams as follows [14]: information => in nf fo or rm ma at ti io on unique digrams = in nf fo or rm ma at ti io on informative => in nf fo or rm ma at ti iv ve unique digrams = in nf fo or rm ma at ti iv ve Thus, "information" has ten digrams, all of which are unique, and "informative" also have ten digrams, all of which are unique. Two eight digram sharing words are unique: in, nf, fo, or, rm, ma, at, and ti. After the digram is unique for the pairs of words that have been identified and counted, the size of the similarity based on them is calculated. The similarity measure used is the Dice coefficient, which is expressed in Eq. 1.
where A is the number of digrams unique in the first word, B the number of digram is unique in seconds, and C digram number is unique which is shared by A and B. For the example above, the dice coefficient will be the same (2 x 8) / (10 + 10) = 0.80. The size of the similarity is determined for all terms in the database. Once the similarities are calculated for all the words their partners are grouped as groups. The Dice Coefficient value gives us a clue that the basic word for this pair is in the first 8 digrams [14].

Hybrid stemmer
This Hybrid technique is a technique that combines several techniques. For example, the lookup table technique is combined with affix removal or something else. The stemming algorithm that uses this technique is Confix Stripping. Confix Stripping removes affixes based on Indonesian morphology and matches them into an Indonesian language dictionary table with the deletion process adjusted to an affix removal rule, one by one.

2.3.
Indonesian morphology The technique used in the Algorithm of this study is based on the word grammar contained in the Indonesian Grammar guidebook from the Ministry of Education and Culture [15]. The basic prefix is the most basic prefix and has not experienced developer. Consists of 6 affixes namely meng, peng-, ber-, di-, ter-and se-. There are some basic prefixes that have developed if strung together with a few basic words with several rules.
Per-remain as per-if coupled with basic word that begins with letter /consonant/. Example, per-+ tanda : sign (pertanda). Per-change into pel-if coupled with the basic word /ajar/. Example, per-+ ajar : student (pelajar). Perchange into pe-if coupled with basic word that begins with letter /r(vocal), tani, tinju/. Exceptions if per-is added to the -an suffix, then per-remain as per-. Example, per-+ tani : farmer (petani).
The basic suffix is the most basic suffix. There is no development or change as in the prefix. The suffix is only three, namely -an, -kan, and -i.
The rules for the combined prefix and suffix are explained in a legal and illegal table Affix, can be seen in In line with the rules, a combination of words or commonly called compound words, including special terms, the elements are written separately. However, if the combination of words gets a prefix and suffix at the same time, the combined elements of the word are written in a series. The basic form of responsibility also must be written a series if you get the prefix and suffix at once. Therefore, writing the correct form of the word is accountability, not responsibility, accountability, or accountability. Combined payoffs on phrases have the same rules as affixes to compound words.

Indonesian stemming
Indonesian stemming was first developed by Nazief and Adriani in 1996. The developed Stemming used a checking technique on the basic word dictionary in each process of removing the affix. Furthermore, there are also those who develop Indonesian stemming without using a dictionary in the process of eliminating the affix, the research was carried out by Tala in 2005. Stemming without the dictionary only uses affix removal techniques as Porter did.

Nazief and Adriani (confix stripping)
Nazief and Adriani [9] stemming algorithms were developed based on dictionary lookup table techniques of basic words and Indonesian language morphological rules which group affixes into prefixes (prefixes), insertions (infix), suffixes (suffixes) and combined prefixes (confixes). This algorithm uses a dictionary of basic words and supports recoding, namely the rearrangement of words that experience an excessive stemming process [11].
The Indonesian morphology rules used in the Confix Stripping algorithm are grouped into the following categories [11]: a) Inflection suffixes are groups of endings that do not change the basic word form. 1) Particle (P), which includes "-lah", "-kah", "-tah", and "-pun". 2) Possessive Pronoun (PP), including "-ku" , "-mu", and "-nya". b) Derivation Suffixes (DS) is a collection of original Indonesian endings which are directly added to the basic words, namely the ending "-i", "-kan", dan "-an".  c) Derivation Prefixes (DP) is a collection of prefixes that can be directly given to pure base words, or to basic words that have received additions up to 2 prefixes. 1) Prefixes that can be morphological ("me-", "be-", "pe-", and "te-") 2) Prefixes that are not morphological ("di-", "ke-" and "se-") These rules are used in the process of stemming algorithms by Nazief and Adriani. But not all composite prefixes are allowed by Confix Stripping [9]. Some affix combinations that are not allowed can be seen in Table 2.
The Confix Stripping algorithm has the following processes [16]: a) Search for words that will be in the dictionary system.
If not then go to step c1. Inflective affixes always in sequence. This algorithm first removes the inflection particle (P) suffix {"-kah", "-lah", "-tah", atau "-pun"}, and then each suffix change the ownership pronoun {"ku", "-mu", or "-nya"}. 1) If "-an" has been deleted and the last letter of the words Is "-k", the "-k" is also deleted. If the word is found in the dictionary, the algorithm stops. If not found then do step c2. 2) The deleted suffix ("-i", "-an" or "-kan") is returned, proceed to step d. d) Derivation Prefix is removed. If in step 3 there is a suffix that is deleted then go to step d2. 1) Check prefix-suffix combination tables that are not permitted. If it is found, the algorithm stops, if it does not go to step 4b. 2) For i = 1 to 3, specify the type of prefix then delete the prefix. If the root word has not been found, do step 5, if the algorithm has stopped. Note: if the second prefix equals the first prefix of the stop algorithm. e) Recoding. f) If all steps have been completed but it does not work, the initial word is assumed to be root word. Process complete. After a number of experiments and analyses, several words that could not be stemmed using Confiz Stripping Stemmer were conducted. Analysis by the Enhanced Confix Stripping Stemmer algorithm for words that failed to be stemmed as follows: a) Lack of decapitation of the word prefix rules in the format "mem+p...", "men+s...", and "peng+k...". This happened to word "mempromosikan", "memproteksi", "mensyaratkan", "mensyukuri", dan "pengkajian". b) The lack of relevance of the rules for the decapitation of the word prefix in the format "menge+basic word" and "penge+basic word", as in the words "mengerem" and "pengeboman". c) There are elements in some basic words that resemble an affix. Words like "pelanggan", "perpolitikan", and "pelaku" fail to be stemmed because the end of "-an", "-kan" and "-ku" should not be eliminated.
To correct the errors above, the ECS Stemmer algorithm performs several improvements as follow: a) Make modifications and additions to the rules. b) Add an additional algorithm to overcome endchopping errors that should not be done. This algorithm is called Returns Suffix loop, and is done if the recoding process fails. c) Return all prefixes that have been removed before, resulting in the word model as follows: [DP+[DP+ [DP]]] + basic word. Decapitation of the prefix is followed by a search process in the dictionary then performed on the word that has been returned to that model. d) Return the suffix according to the sequence of models in Indonesian. This means that the return starts from DS ("-i", "-kan", "-an"), then PP("-ku", "-mu", "nya"), and finally P ("-lah", "-kah", "-tah", "-pun"). For each return, do steps 3) to 5) below. Especially for the "-kan" suffix, the first return starts with "k", then it continues with "an". e) Check in the basic word dictionary. If found, the process is stopped. If it fails, then do the prefix process based on the rules. f) Perform recoding if needed. g) If checking in the base word dictionary still fails after recoding, then the omitted prefixes are returned again. This algorithm still has several disadvantages that must be corrected, i.e.: • Elimination of affixes to compound words that have combined additions. • Over stemming and under stemming • The speed of the stemming process

Ledy Agusta (porter)
According to Milutinovich, Porter's stemmer algorithm was first discovered in 1979 by Martin Porter in the computer lab. Porter stemming algorithm is a process of removing English morphology suffixes and inflections of words. The Porter algorithm, which was originally developed for English, was developed for Indonesian by Frakes [6]. Porter's stemmer works well in English [17]. Porter stemmer has become the standard stemmer for English and the same stemming approach has been adopted for other languages i.e. Romance (French, Italian, Portuguese and Spanish), Germanic (Dutch and German), Scandinavian languages (Danish, Norwegian and Swedish), Finnish and Russia [18]. Porter stemmer is a linear stemmer step, applying morphological rules sequentially allows the elimination of affixes gradually [19].
The steps of this algorithm are as follows [12]: a) Remove particle. b) Remove obsessive pronoun. c) Remove the first prefix. If it doesn't exist, then proceed to step d. Whereas if there is, then proceed to step e. d) Delete the second prefix, then proceed to step f. e) Delete suffix, if it is not found, the word is assumed to be root word. Whereas if found, then proceed to step g. f) Remove suffix. Then the final word is assumed to be a basic word. g) Remove the second prefix. Then the final word is assumed to be a basic word. This algorithm still has several disadvantages, i.e. Over stemming and Under stemming.

Autonomy stemmer
The ECS Stemmer algorithm uses the basic word lookup table technique and the removal process using affix removal techniques. The Indonesian Porter algorithm only uses affix removal techniques. Both of them use Indonesian language morphology as the basis for deletion.
The modification that we propose is to use the Indonesian Language morphology rules as a reference for eliminating affixes by applying them to the Regular Language Expression. The steps of the proposed stemming algorithm can be seen in Fig. 2.  With the following information: a) Analyzing the words that will be stemmed, if the word character is ≤ 3, then the process is complete. b) Analyzing the words that will be stemmed, if it has an illegal compounding according to Table 1, the process is complete. c) Analyzing the words that will be stemmed, if the word structure matches the regular expression formula in Table 3, then delete the affix.
The following is a description of which means the prefix is met except letter e then meets with letters a to z with a minimum number of characters 3 and or ending in or ending -an, substitute for property and particles. Then the stemming process is carried out, the output is the suffix prefix of the succession and the ending is deleted. f) And for the next number, how to read it is almost the same as the way above.

Result and Discussion
This chapter explains the results and discussion in our study. Our experiment is done in several processes, i.e. the first to input 118 news document, then the next step is to do preprocessing on punctuation and conjunctions. After preprocessing is done, the program separates paragraph into the table in every word to do the word stemming process. The amount from this process is 15792 words that will be in the system. The next step of the experiment in this research is to classify the results of the stemming. The following is a complete explanation of the trial process in this study:

News Document Dataset Input
The dataset used in this experiment is a crawl news document of some 118 online news sites. After preprocessing, the dataset obtained 15792 words stem.

Classification of Result of Stem Errors and Fixes
The process of classifying the results of errors and stemming improvements in our study uses a manual method by categorizing the types of words in the stem. Our truth is prediction by matching stemming words with the Indonesian language dictionary dataset. If the stemming yield word is not found in the dictionary database, the word trust counts 0. Here are some of the classifications: Economy (Ekonominya) Ekonomi Ekonom Ekonomi Ekonomi a) Improvements in stemming compound words One of our aims to conduct research on modification of stemmer is to correct errors in compound word stem. Some of them are summarized in Table IV. In the table, there are 9 examples of compound words that are given combined additions. Can be seen in the results that the algorithm that we propose gets the best results among other algorithms. b) Improvements to over stemming and under stemming There are some over stemming and under stemming that can be corrected by the algorithm that is proposed. This is showed in Table V. In the table we only list 6 words that are successful in the algorithm that we propose, while the other words we include in Appendix A.

Calculation of Average Results
The calculation of the average stemming yield that described in this point. There are 3 calculations that are done, i.e. the number of word truth, the average speed of the process and the percentage of the word truth. Here is an explanation of how to calculate it:

a) Average word truth
The average truth of the word we counted from 118 datasets processed which produced 15,792 words. The results obtained from the word truth in the modification algorithm get 10,449 correct words, in the ECS algorithm get 11,530 correct words, and in the Porter, the algorithm gets as many as 9,043 words correctly. Word errors in the modification algorithm get 5,343 incorrect words, in the ECS algorithm get 4,262 incorrect words, and in the Porter, the algorithm gets 6,749 incorrect words. Can be seen in Table 6 and Fig. 3.

b) Average process speed
The average speed obtained to process 118 news documents with 15,792 words can be seen in Table 6. In the Modification algorithm the speed reaches 0.0051 seconds, at ECS it reaches 1.9195 seconds and the Porter algorithm reaches 0.0039 seconds.

c) Percentage of the word truth
We summarize the overall average in Table 6. The truth in the Modification algorithm that we propose is 66% word truth, the ECS algorithm gets 73% and the Porter algorithm gets 57%. We get a percentage of no more than 70% because in the dataset we use is not the whole standard language document, so there are still many nonstandard words that are the results of the stem that we cannot match in the dictionary when the word truth calculation process.

Conclusion
This study got the results of the initial goal of getting stemming results that have high accuracy and speed by not relying on the dictionary during the process of removing the additive so that it can do the elimination of affixes to compound words. From the results of the trial obtained accuracy of 10,449 true words with an accuracy of 66%, while Porter gets 9,043 correct words with an accuracy of 57%. The second objective of this study was to be able to get a faster stemming time from ECS which was equal to 0.0051, while ECS obtained a stem processing time for 15,792 words of 1.9195 seconds. This study has several shortcomings, so there is a need for further development, namely improvements to over stemming and under stemming in words that have foreign affixes, Sanskrit additions, and inserts.