301 research outputs found
Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light
term conation step and useful in case of few language-specific resources. For English, the corpusbased
stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.
Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from
selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness
compared to using a fixed number of terms for different languages
Overview of Digital Library Components and Developments
Digital libraries are being built upon a firm foundation of prior work as the high-end information systems of the future. A component architecture approach is becoming popular, with well established support for key components like the repository, especially through the Open Archives Initiative. We consider digital objects, metadata, harvesting, indexing, searching, browsing, rights management, linking, and powerful interfaces. Flexible interaction will be possible through a variety of architectures, using buses, agents, and other technologies. The field as a whole is undergoing rapid growth, supported by advances in storage, processing, networking, algorithms, and interaction. There are many initiatives and developments, including those supporting education, and these will certainly be of benefit in Latin America
Context-Aware Stemming algorithm for semantically related root words
There is a growing interest in the use of context-awareness as a technique for developing pervasive computing applications that are
flexible and adaptable for users. In this context, however, information retrieval (IR) is often defined in terms of location and delivery
of documents to a user to satisfy their information need. In most cases, morphological variants of words have similar semantic
interpretations and can be considered as equivalent for the purpose of IR applications. Consequently, document indexing will also be
more meaningful if semantically related root words are used instead of stems. The popular Porterâs stemmer was studied with the aim
to produce intelligible stems. In this paper, we propose Context-Aware Stemming (CAS) algorithm, which is a modified version of
the extensively used Porterâs stemmer. Considering only generated meaningful stemming words as the stemmer output, the results
show that the modified algorithm significantly reduces the error rate of Porterâs algorithm from 76.7% to 6.7% without compromising
the efficacy of Porterâs algorithm
Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval
Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der ĂŒberwiegende Teil textuell kodierter Information elektronisch verfĂŒgbar. Hiermit kommt der Entwicklung leistungsfĂ€higer Methoden zur effizienten Recherche eine vorrangige Bedeutung zu.
Bewertet man die NĂŒtzlichkeit gĂ€ngiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer FunktionalitĂ€t (Flexion, Derivation und Komposition), lexikalisch-semantischer FunktionalitĂ€t und der FĂ€higkeit zu einer sprachĂŒbergreifenden Analyse groĂer DokumentenbestĂ€nde.
In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym fĂŒr Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen EintrĂ€ge mittels semantischer Relationen sprachĂŒbergreifend verknĂŒpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhĂ€ngige, konzeptklassenartige Symbole ersetzt werden. Die resultierende ReprĂ€sentation ist die Basis fĂŒr das sprachĂŒbergreifende, morphemorientierte Textretrieval.
Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von LexikoneintrĂ€gen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergĂ€nzt werden. Die BerĂŒcksichtigung sprachĂŒbergreifender PhĂ€nomene fĂŒhrt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen AmbiguitĂ€ten.
Die LeistungsfĂ€higkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gĂ€ngigen Herangehensweisen gegenĂŒbergestellt
Adaptive Visualization for Focused Personalized Information Retrieval
The new trend on the Web has totally changed todays information access environment. The traditional information overload problem has evolved into the qualitative level beyond the quantitative growth. The mode of producing and consuming information is changing and we need a new paradigm for accessing information.Personalized search is one of the most promising answers to this problem. However, it still follows the old interaction model and representation method of classic information retrieval approaches. This limitation can harm the potential of personalized search, with which users are intended to interact with the system, learn and investigate the problem, and collaborate with the system to reach the final goal.This dissertation proposes to incorporate interactive visualization into personalized search in order to overcome the limitation. By combining the personalized search and the interac- tive visualization, we expect our approach will be able to help users to better explore the information space and locate relevant information more efficiently.We extended a well-known visualization framework called VIBE (Visual Information Browsing Environment) and implemented Adaptive VIBE, so that it can fit into the per- sonalized searching environment. We tested the effectiveness of this adaptive visualization method and investigated its strengths and weaknesses by conducting a full-scale user study.We also tried to enrich the user models with named-entities considering the possibility that the traditional keyword-based user models could harm the effectiveness of the system in the context of interactive information retrieval.The results of the user study showed that the Adaptive VIBE could improve the precision of the personalized search system and could help the users to find out more diverse set of information. The named-entity based user model integrated into Adaptive VIBE showed improvements of precision of user annotations while maintaining the level of diverse discovery of information
Surfing the modeling of pos taggers in low-resource scenarios
The recent trend toward the application of deep structured techniques has revealed the limits of huge models in natural language processing. This has reawakened the interest in traditional machine learning algorithms, which have proved still to be competitive in certain contexts, particularly in low-resource settings. In parallel, model selection has become an essential task to boost performance at reasonable cost, even more so when we talk about processes involving domains where the training and/or computational resources are scarce. Against this backdrop, we evaluate the early estimation of learning curves as a practical mechanism for selecting the most appropriate model in scenarios characterized by the use of non-deep learners in resource-lean settings. On the basis of a formal approximation model previously evaluated under conditions of wide availability of training and validation resources, we study the reliability of such an approach in a different and much more demanding operational environment. Using as a case study the generation of pos taggers for Galician, a language belonging to the Western Ibero-Romance group, the experimental results are consistent with our expectations.Ministerio de Ciencia e InnovaciĂłn | Ref. PID2020-113230RB-C21Ministerio de Ciencia e InnovaciĂłn | Ref. PID2020-113230RB-C22Xunta de Galicia | Ref. ED431C 2020/1
Spatial and temporal-based query disambiguation for improving web search
Queries submitted to search engines are ambiguous in nature due to usersâ irrelevant input which poses real challenges to web search engines both towards understanding a query and giving results. A lot of irrelevant and ambiguous information creates disappointment among users. Thus, this research proposes an ambiguity evolvement process followed by an integrated use of spatial and temporal features to alleviate the search results imprecision. To enhance the effectiveness of web information retrieval the study develops an enhanced Adaptive Disambiguation Approach for web search queries to overcome the problems caused by ambiguous queries. A query classification method was used to filter search results to overcome the imprecision. An algorithm was utilized for finding the similarity of the search results based on spatial and temporal features. Usersâ selection based on web results facilitated recording of implicit feedback which was then utilized for web search improvement. Performance evaluation was conducted on data sets GISQC_DS, AMBIENT and MORESQUE comprising of ambiguous queries to certify the effectiveness of the proposed approach in comparison to a well-known temporal evaluation and two-box search methods. The implemented prototype is focused on ambiguous queries to be classified by spatial or temporal features. Spatial queries focus on targeting the location information whereas temporal queries target time in years. In conclusion, the study used search results in the context of Spatial Information Retrieval (S-IR) along with temporal information. Experiments results show that the use of spatial and temporal features in combination can significantly improve the performance in terms of precision (92%), accuracy (93%), recall (95%), and f-measure (93%). Moreover, the use of implicit feedback has a significant impact on the search results which has been demonstrated through experimental evaluation.SHAHID KAMA
Mining Meaning from Wikipedia
Wikipedia is a goldmine of information; not just for its many readers, but
also for the growing community of researchers who recognize it as a resource of
exceptional scale and utility. It represents a vast investment of manual effort
and judgment: a huge, constantly evolving tapestry of concepts and relations
that is being applied to a host of tasks.
This article provides a comprehensive description of this work. It focuses on
research that extracts and makes use of the concepts, relations, facts and
descriptions found in Wikipedia, and organizes the work into four broad
categories: applying Wikipedia to natural language processing; using it to
facilitate information retrieval and information extraction; and as a resource
for ontology building. The article addresses how Wikipedia is being used as is,
how it is being improved and adapted, and how it is being combined with other
structures to create entirely new resources. We identify the research groups
and individuals involved, and how their work has developed in the last few
years. We provide a comprehensive list of the open-source software they have
produced.Comment: An extensive survey of re-using information in Wikipedia in natural
language processing, information retrieval and extraction and ontology
building. Accepted for publication in International Journal of Human-Computer
Studie
- âŠ