557 research outputs found
DCU@FIRE2010: term conflation, blind relevance feedback, and cross-language IR with manual and automatic query translation
For the first participation of Dublin City University (DCU)
in the FIRE 2010 evaluation campaign, information retrieval
(IR) experiments on English, Bengali, Hindi, and Marathi
documents were performed to investigate term conation
(different stemming approaches and indexing word prefixes),
blind relevance feedback, and manual and automatic query
translation. The experiments are based on BM25 and on
language modeling (LM) for IR. Results show that term conation always improves mean average precision (MAP)
compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi,
the corpus-based stemmer achieves a higher MAP. For Bengali, the LM retrieval model achieves a much higher MAP
than BM25 (0.4944 vs. 0.4526). In all experiments using
BM25, blind relevance feedback yields considerably higher
MAP in comparison to experiments without it. Bilingual IR experiments (English!Bengali and English!Hindi) are
based on query translations obtained from native speakers
and the Google translate web service. For the automatically
translated queries, MAP is slightly (but not significantly)
lower compared to experiments with manual query translations. The bilingual English!Bengali (English!Hindi)
experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best
corresponding monolingual experiments
Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light
term conation step and useful in case of few language-specific resources. For English, the corpusbased
stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.
Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from
selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness
compared to using a fixed number of terms for different languages
Thesaurus-based index term extraction for agricultural documents
This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction
How effective is stemming and decompounding for German text retrieval?
Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch
A Case Study of Using Domain Analysis for the Conflation Algorithms Domain
This paper documents the domain engineering process for much
of the conflation algorithms domain. Empirical data on the process and
products of domain engineering were collected. Six conflation
algorithms of four different types: three affix removal, one successor
variety, one table lookup, and one n-gram were analyzed. Products of
the analysis include a generic architecture, reusable components, a
little language and an application generator that extends the scope of
the domain analysis beyond previous generators. The application
generator produces source code for not only affix removal type but
also successor variety, table lookup, and n-gram stemmers. The
performance of the stemmers generated automatically was compared with
the stemmers developed manually in terms of stem similarity, source
and executable sizes, and development and execution times. All five
stemmers generated by the application generator produced more than
99.9% identical stems with the manually developed stemmers. Some of
the generated stemmers were as efficient as their manual equivalents
and some were not
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Context-Aware Stemming algorithm for semantically related root words
There is a growing interest in the use of context-awareness as a technique for developing pervasive computing applications that are
flexible and adaptable for users. In this context, however, information retrieval (IR) is often defined in terms of location and delivery
of documents to a user to satisfy their information need. In most cases, morphological variants of words have similar semantic
interpretations and can be considered as equivalent for the purpose of IR applications. Consequently, document indexing will also be
more meaningful if semantically related root words are used instead of stems. The popular Porter’s stemmer was studied with the aim
to produce intelligible stems. In this paper, we propose Context-Aware Stemming (CAS) algorithm, which is a modified version of
the extensively used Porter’s stemmer. Considering only generated meaningful stemming words as the stemmer output, the results
show that the modified algorithm significantly reduces the error rate of Porter’s algorithm from 76.7% to 6.7% without compromising
the efficacy of Porter’s algorithm
An Overview of Context Capturing Techniques in NLP
In the NLP context identification has become a prominent way to overcome syntactic and semantic ambiguities. Ambiguities are unsolved problems but can be reduced to a certain level. This ambiguity reduction helps to improve the quality of several NLP processes, such as text translation, text simplification, text retrieval, word sense disambiguation, etc. Context identification, also known as contextualization, takes place in the preprocessing phase of NLP processes. The essence of this identification is to uniquely represent a word or a phrase to improve the decision-making during the transfer phase of the NLP processes. The improved decision-making helps to improve the quality of the output. This paper tries to provide an overview of different context-capturing mechanisms used in NLP
- …