180 research outputs found

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Incremental clustering of news reports

    Get PDF
    When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated processes may be required to help manage this information. In this paper, we describe a clustering system that can cluster news reports from disparate sources into event-centric clusters—i.e., clusters of news reports describing the same event. A user can identify any RSS feed as a source of news he/she would like to receive and our clustering system can cluster reports received from the separate RSS feeds as they arrive without knowing the number of clusters in advance. Our clustering system was designed to function well in an online incremental environment. In evaluating our system, we found that our system is very good in performing fine-grained clustering, but performs rather poorly when performing coarser-grained clustering.peer-reviewe

    Concept-based Interactive Query Expansion Support Tool (CIQUEST)

    Get PDF
    This report describes a three-year project (2000-03) undertaken in the Information Studies Department at The University of Sheffield and funded by Resource, The Council for Museums, Archives and Libraries. The overall aim of the research was to provide user support for query formulation and reformulation in searching large-scale textual resources including those of the World Wide Web. More specifically the objectives were: to investigate and evaluate methods for the automatic generation and organisation of concepts derived from retrieved document sets, based on statistical methods for term weighting; and to conduct user-based evaluations on the understanding, presentation and retrieval effectiveness of concept structures in selecting candidate terms for interactive query expansion. The TREC test collection formed the basis for the seven evaluative experiments conducted in the course of the project. These formed four distinct phases in the project plan. In the first phase, a series of experiments was conducted to investigate further techniques for concept derivation and hierarchical organisation and structure. The second phase was concerned with user-based validation of the concept structures. Results of phases 1 and 2 informed on the design of the test system and the user interface was developed in phase 3. The final phase entailed a user-based summative evaluation of the CiQuest system. The main findings demonstrate that concept hierarchies can effectively be generated from sets of retrieved documents and displayed to searchers in a meaningful way. The approach provides the searcher with an overview of the contents of the retrieved documents, which in turn facilitates the viewing of documents and selection of the most relevant ones. Concept hierarchies are a good source of terms for query expansion and can improve precision. The extraction of descriptive phrases as an alternative source of terms was also effective. With respect to presentation, cascading menus were easy to browse for selecting terms and for viewing documents. In conclusion the project dissemination programme and future work are outlined

    Effects of Using Culture-laden Texts on Culture-specific Translation Skills in Arab Students

    Get PDF
    This work explores the possibility of translating culture-laden texts from English into Arabic, relying on a corpus of literary texts representing the local cultures of the 16th. and the 18th. century England. Furthermore, it taps into the applicability of SL cultural texts to help students appropriately render a culture-specific lexicon. The present study makes use of an experimental research design to check whether the use of literary texts may help EFL learners enhance their culture-specific translation skills in a traditional instruction setting. Finally, its findings indicate that the use of culture-laden literary texts can be efficient in introducing the culturally loaded lexicon of the English language, thereby inducing improvements in translational writing skills of EFL learners of translation

    Social impact retrieval: measuring author inïŹ‚uence on information retrieval

    Get PDF
    The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly. In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of authority and centrality to the user. This measure is then used to attribute an query-independent interest score to each of the contributions the author makes, enabling us to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two algorithms; AuthorRank and MessageRank. We present two real-world user experiments which focussed around multimedia annotation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the eïŹ€ectiveness of incorporating user generated content, or annotations, to improve information retrieval

    Towards effective cross-lingual search of user-generated internet speech

    Get PDF
    The very rapid growth in user-generated social spoken content on online platforms is creating new challenges for Spoken Content Retrieval (SCR) technologies. There are many potential choices for how to design a robust SCR framework for UGS content, but the current lack of detailed investigation means that there is a lack of understanding of the specifc challenges, and little or no guidance available to inform these choices. This thesis investigates the challenges of effective SCR for UGS content, and proposes novel SCR methods that are designed to cope with the challenges of UGS content. The work presented in this thesis can be divided into three areas of contribution as follows. The first contribution of this work is critiquing the issues and challenges that in influence the effectiveness of searching UGS content in both mono-lingual and cross-lingual settings. The second contribution is to develop an effective Query Expansion (QE) method for UGS. This research reports that, encountered in UGS content, the variation in the length, quality and structure of the relevant documents can harm the effectiveness of QE techniques across different queries. Seeking to address this issue, this work examines the utilisation of Query Performance Prediction (QPP) techniques for improving QE in UGS, and presents a novel framework specifically designed for predicting of the effectiveness of QE. Thirdly, this work extends the utilisation of QPP in UGS search to improve cross-lingual search for UGS by predicting the translation effectiveness. The thesis proposes novel methods to estimate the quality of translation for cross-lingual UGS search. An empirical evaluation that demonstrates the quality of the proposed method on alternative translation outputs extracted from several Machine Translation (MT) systems developed for this task. The research then shows how this framework can be integrated in cross-lingual UGS search to find relevant translations for improved retrieval performance

    Approximate information filtering in structured peer-to-peer networks

    Get PDF
    Today';s content providers are naturally distributed and produce large amounts of information every day, making peer-to-peer data management a promising approach offering scalability, adaptivity to dynamics, and failure resilience. In such systems, subscribing with a continuous query is of equal importance as one-time querying since it allows the user to cope with the high rate of information production and avoid the cognitive overload of repeated searches. In the information filtering setting users specify continuous queries, thus subscribing to newly appearing documents satisfying the query conditions. Contrary to existing approaches providing exact information filtering functionality, this doctoral thesis introduces the concept of approximate information filtering, where users subscribe to only a few selected sources most likely to satisfy their information demand. This way, efficiency and scalability are enhanced by trading a small reduction in recall for lower message traffic. This thesis contains the following contributions: (i) the first architecture to support approximate information filtering in structured peer-to-peer networks, (ii) novel strategies to select the most appropriate publishers by taking into account correlations among keywords, (iii) a prototype implementation for approximate information retrieval and filtering, and (iv) a digital library use case to demonstrate the integration of retrieval and filtering in a unified system.Heutige Content-Anbieter sind verteilt und produzieren riesige Mengen an Daten jeden Tag. Daher wird die Datenhaltung in Peer-to-Peer Netzen zu einem vielversprechenden Ansatz, der Skalierbarkeit, Anpassbarkeit an Dynamik und Ausfallsicherheit bietet. FĂŒr solche Systeme besitzt das Abonnieren mit Daueranfragen die gleiche Wichtigkeit wie einmalige Anfragen, da dies dem Nutzer erlaubt, mit der hohen Datenrate umzugehen und gleichzeitig die Überlastung durch erneutes Suchen verhindert. Im Information Filtering Szenario legen Nutzer Daueranfragen fest und abonnieren dadurch neue Dokumente, die die Anfrage erfĂŒllen. Im Gegensatz zu vorhandenen AnsĂ€tzen fĂŒr exaktes Information Filtering fĂŒhrt diese Doktorarbeit das Konzept von approximativem Information Filtering ein. Ein Nutzer abonniert nur wenige ausgewĂ€hlte Quellen, die am ehesten die Anfrage erfĂŒllen werden. Effizienz und Skalierbarkeit werden verbessert, indem Recall gegen einen geringeren Nachrichtenverkehr eingetauscht wird. Diese Arbeit beinhaltet folgende BeitrĂ€ge: (i) die erste Architektur fĂŒr approximatives Information Filtering in strukturierten Peer-to-Peer Netzen, (ii) Strategien zur Wahl der besten Anbieter unter BerĂŒcksichtigung von SchlĂŒsselwörter-Korrelationen, (iii) ein Prototyp, der approximatives Information Retrieval und Filtering realisiert und (iv) ein Anwendungsfall fĂŒr Digitale Bibliotheken, der beide FunktionalitĂ€ten in einem vereinten System aufzeigt

    Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study

    Get PDF
    Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task
    • 

    corecore