322 research outputs found
Stemmer for Serbian language
In linguistic morphology and information retrieval, stemming is the process
for reducing inflected (or sometimes derived) words to their stem, base or root
form; generally a written word form. In this work is presented suffix stripping
stemmer for Serbian language, one of the highly inflectional languages.Comment: 16 pages, 8 figures, code include
A stemming algorithm for Latvian
The thesis covers construction, application and evaluation of a stemming algorithm for
advanced information searching and retrieval in Latvian databases. Its aim is to examine
the following two questions:
Is it possible to apply for Latvian a suffix removal algorithm originally designed
for English?
Can stemming in Latvian produce the same or better information retrieval results
than manual truncation?
In order to achieve these aims, the role and importance of automatic word conflation
both for document indexing and information retrieval are characterised. A review of
literature, which analyzes and evaluates different types of stemming techniques and
retrospective development of stemming algorithms, justifies the necessity to apply this
advanced IR method also for Latvian. Comparative analysis of morphological structure
both for English and Latvian language determined the selection of Porter's suffix
removal algorithm as a basis for the Latvian sternmer.
An extensive list of Latvian stopwords including conjunctions, particles and adverbs,
was designed and added to the initial sternmer in order to eliminate insignificant words
from further processing. A number of specific modifications and changes related to the
Latvian language were carried out to the structure and rules of the original stemming
algorithm.
Analysis of word stemming based on Latvian electronic dictionary and Latvian text
fragments confirmed that the suffix removal technique can be successfully applied also
to Latvian language. An evaluation study of user search statements revealed that the
stemming algorithm to a certain extent can improve effectiveness of information
retrieval
A Performance Evaluation of Classifiers Employ Language Dependent Tools for Indonesian Text
This paper evaluates the performance of Maximum
Entropy (MaxEnt), Support Vector Machine (SVM) and Na¨ıve
Bayes (NB) techniques for Indonesian text classification. Performance
of MaxEnt and SVM techniques are compared against
baseline NB technique. We also investigate the effect of language
dependent tools such as Indonesian stemming and stop words
removal can have on these techniques for text classification performances.
Up to now, there is no experimental report about the
effect of Indonesian stemmer on the text classification accuracy.
From our experiments, we conclude that maximum entropy
performs better than other classifiers in general. Language
dependent tools such as stemming and stop words removal have
only little effect on the accuracy of text classification. However
stemmed approach scored highest average accuracy and due to
the dimension reduction of feature vectors used in classification,
make this approach is viable step in pre-processing stage
Viewing morphology as an inference process
AbstractMorphology is the area of linguistics concerned with the internal structure of words. Information retrieval has generally not paid much attention to word structure, other than to account for some of the variability in word forms via the use of stemmers. We report on our experiments to determine the importance of morphology, and the effect that it has on performance. We found that grouping morphological variants makes a significant improvement in retrieval performance. Improvements are seen by grouping inflectional as well as derivational variants. We also found that performance was enhanced by recognizing lexical phrases. We describe the interaction between morphology and lexical ambiguity, and how resolving that ambiguity will lead to further improvements in performance
Arabic stemmers and their effectiveness on the information retrieval system
Arabic is a semitic language that has a complex morphology. Therefore, using a stemmer algorithm in an information retrieval system is almost always beneficial; An Arabic stemmer has been implemented and included in the information retrieval system developed at the Information Science Research Institute at the University of Nevada Las Vegas. The Arabic stemmer is written in the Ruby Language and removes affixes then matches the remaining word against patterns of the same length. The retrieval experiment uses the TREC collection which consists of over a million documents. We will test the effectiveness of the Arabic stemmer using recall/precision measurement and compare the result to other stemmers
A light weight stemmer for Bengali and its use in spelling checker
Includes bibliographical references (page 6).Stemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis. It is used to improve the performance of spelling checkers and information retrieval applications, where morphological analysis would be too computationally expensive. For spelling
checkers specifically, using stemming may drastically reduce the dictionary size, often a bottleneck for mobile and embedded devices. This paper presents a computationally inexpensive stemming algorithm for Bengali, which handles suffix removal in a domain independent way. The evaluation of the proposed algorithm in a Bengali spelling checker indicates that
it can be effectively used in information retrieval applications in general.Md. Zahurul IslamMd. Nizam UddinMumit Kha
How effective is stemming and decompounding for German text retrieval?
Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch
- …