164 research outputs found
Viewing morphology as an inference process
AbstractMorphology is the area of linguistics concerned with the internal structure of words. Information retrieval has generally not paid much attention to word structure, other than to account for some of the variability in word forms via the use of stemmers. We report on our experiments to determine the importance of morphology, and the effect that it has on performance. We found that grouping morphological variants makes a significant improvement in retrieval performance. Improvements are seen by grouping inflectional as well as derivational variants. We also found that performance was enhanced by recognizing lexical phrases. We describe the interaction between morphology and lexical ambiguity, and how resolving that ambiguity will lead to further improvements in performance
Morphological variation of Arabic queries
Although it has been shown that in test collection based studies,
stemming improves retrieval effectiveness in an information retrieval system,
morphological variations of queries searching on the same topic are less well
understood. This work examines the broad morphological variation that
searchers of an Arabic retrieval system put into their queries. In this study, 15
native Arabic speakers were asked to generate queries, morphological variants
of query words were collated across users. Queries composed of either the
commonest or rarest variants of each word were submitted to a retrieval system
and the effectiveness of the searches was measured. It was found that queries
composed of the more popular morphological variants were more likely to
retrieve relevant documents that those composed of less popular
The Porter stemming algorithm: then and now
Purpose: In 1980, Porter presented a simple algorithm for stemming English language words. This paper summarises the main features of the algorithm, and highlights its role not just in modern information retrieval research, but also in a range of related subject domains.
Design: Review of literature and research involving use of the Porter algorithm.
Findings: The algorithm has been widely adopted and extended so that it has become the standard approach to word conflation for information retrieval in a wide range of languages.
Value: The 1980 paper in Program by Porter describing his algorithm has been highly cited. This paper provides a context for the original paper as well as an overview of its subsequent use
A finite-state approach to arabic broken noun morphology
In this paper, a finite-state computational approach to Arabic broken plural noun morphology is introduced. The paper considers the derivational aspect of the approach, and how generalizations about dependencies in the broken plural noun derivational system of Arabic are captured and handled computationally in this finite-state approach. The approach will be implemented using Xerox finite-state tool
NewsMe: A case study for adaptive news systems with open user model
Adaptive news systems have become important in recent years. A lot of work has been put into developing these adaptation processes. We describe here an adaptive news system application, which uses an open user model and allow users to manipulate their interest profiles. We also present a study of the system. Our results showed that user profile manipulation should be used with caution. © 2007 IEEE
DCU@FIRE2010: term conflation, blind relevance feedback, and cross-language IR with manual and automatic query translation
For the first participation of Dublin City University (DCU)
in the FIRE 2010 evaluation campaign, information retrieval
(IR) experiments on English, Bengali, Hindi, and Marathi
documents were performed to investigate term conation
(different stemming approaches and indexing word prefixes),
blind relevance feedback, and manual and automatic query
translation. The experiments are based on BM25 and on
language modeling (LM) for IR. Results show that term conation always improves mean average precision (MAP)
compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi,
the corpus-based stemmer achieves a higher MAP. For Bengali, the LM retrieval model achieves a much higher MAP
than BM25 (0.4944 vs. 0.4526). In all experiments using
BM25, blind relevance feedback yields considerably higher
MAP in comparison to experiments without it. Bilingual IR experiments (English!Bengali and English!Hindi) are
based on query translations obtained from native speakers
and the Google translate web service. For the automatically
translated queries, MAP is slightly (but not significantly)
lower compared to experiments with manual query translations. The bilingual English!Bengali (English!Hindi)
experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best
corresponding monolingual experiments
Query Expansion with Locally-Trained Word Embeddings
Continuous space word embeddings have received a great deal of attention in
the natural language processing and machine learning communities for their
ability to model term similarity and other relationships. We study the use of
term relatedness in the context of query expansion for ad hoc information
retrieval. We demonstrate that word embeddings such as word2vec and GloVe, when
trained globally, underperform corpus and query specific embeddings for
retrieval tasks. These results suggest that other tasks benefiting from global
embeddings may also benefit from local embeddings
DCU@FIRE-2012: rule-based stemmers for Bengali and Hindi
For the participation of Dublin City University (DCU) in the FIRE-2012 Morpheme Extraction Task (MET), we investigated a rule based stemming approaches for Bengali and Hindi IR. The MET task itself is an attempt to obtain a fair and direct comparison between various stemming approaches measured by comparing the retrieval effectiveness obtained by each on the same dataset. Linguistic knowledge was used to manually craft the rules for removing the commonly occurring plural suffixes for Hindi and Bengali. Additionally, rules for removing classifiers and case markers in Bengali were also formulated. Our rule-based stemming approaches produced the best and the second-best retrieval effectiveness for Hindi and Bengali datasets respectively
- …