research

HSRA: Hindi stopword removal algorithm

Abstract

In the last few years, electronic documents have been the main source of data in many research areas like Web Mining, Information Retrieval, Artificial Intelligence, Natural Language Processing etc. Text Processing plays a vital role for processing structured or unstructured data from the web. Preprocessing is the main step in any text processing systems. One significant preprocessing technique is the elimination of functional words, also known as stopwords, which affects the performance of text processing tasks. An efficient stopword removal technique is required in all text processing tasks. In this paper, we are proposing a stopword removal algorithm for Hindi Language which is using the concept of a Deterministic Finite Automata (DFA). A large number of available works on stopword removal techniques are based on dictionary containing stopword lists. Then pattern matching technique is applied and the matched patterns, which is a stopword, is removed from the document. It is a time consuming task as searching process takes a long time. This makes the method inefficient and very expensive. In comparison of that, our algorithm has been tested on 200 documents and achieved 99% accuracy and also time efficient

    Similar works