5 research outputs found

    High-performance Processing of Text Queries with Tunable Pruned Term and Term Pair Indexes

    No full text

    Indexing methods for web archives

    Get PDF
    There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text reposi- tories. Web archives are such continuously growing text collections which contain ver- sions of documents spanning over long time periods. Web archives present many op- portunities for historical, cultural and political analyses. Consequently there is a grow- ing need for tools which can efficiently access and search them. In this work, we are interested in indexing methods for supporting text-search work- loads over web archives like time-travel queries and phrase queries. To this end we make the following contributions: • Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We in- troduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections. • We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. • We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query- optimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.In der jüngsten Vergangenheit gab es zahlreiche Bemühungen zuvor veröffentlichte Inhalte zu digitalisieren und elektronisch erstellte Inhalte zu erhalten. Dies führte zu einem weit verbreitenden Anstieg großer Textdatenbestände. Webarchive sind eine solche Art konstant ansteigender Textdatensammlung. Sie enthalten mehrere Versionen von Dokumenten, welche sich über längere Zeiträume erstrecken. Darüber hinaus bieten sie viele Möglichkeiten für historische, kulturelle und politische Analysen. Infolgedessen gibt es einen wachsenden Bedarf an Werkzeugen, die eine effiziente Suche in Webarchiven und einen effizienten Zugriff auf die Daten erlauben. Der Fokus dieser Arbeit liegt auf Indexierungsverfahren, um die Arbeitslast von Textsuche auf Webarchiven zu unterstützen, wie zum Beispiel time-travel queries oder phrase queries. Zu diesem Zweck leisten wir folgende Beiträge: • Time-travel queries sind Suchwortanfragen mit einem temporalen Prädikat. Zum Beispiel liefert die Anfrage “mpii saarland” @ [06/2009] Versionen des Dokuments aus der Vergangenheit als Ergebnis. Zur effizienten Unterstützung solcher Anfragen ohne die Indexgröße aufzublasen, stellen wir eine neue Strategie zur Organisation von Indizes dar, so genanntes index sharding. Des Weiteren schlagen wir Wartungsverfahren für Indizes vor, die für solch konstant wachsende Datensätze skalieren. • WirentwickelnTechnikenzurAnfrageoptimierungvontime-travelqueries, nachstehend partition selection genannt. Diese maximieren den Recall in jeder Phase der Anfrageverarbeitung. • Wir stellen Indexierungsmethoden vor, die phrase queries unterstützen, z. B. “Sein oder Nichtsein, das ist hier die Frage”. Wir indexieren Sequenzen bestehend aus mehreren Wörtern und entwerfen neue Optimierungsverfahren für die indexierten Sequenzen, um phrase queries effizient zu beantworten. Die Performanz dieser Verfahren wird anhand von ausführlichen Experimenten auf realen Webarchiven demonstriert

    Efficient and effective retrieval using Higher-Order proximity models

    Get PDF
    Information Retrieval systems are widely used to retrieve documents that are relevant to a user's information need. Systems leveraging proximity heuristics to estimate the relevance of a document have shown to be effective. However, the computational cost of proximity-based models is rarely considered, which is an important concern over large-scale document collections. The large-scale collections also make collection-based evaluation challenging since only a small number of documents are judged given the limited budget. Effectiveness, efficiency and reliable evaluation are coherent components that should be considered when developing a good retrieval system.This thesis makes several contributions from the three aspects. Many proximity-based retrieval models are effective, but it is also important to find efficient solutions to extract proximity features, especially for models using higher-order proximity statistics. We therefore propose a one-pass algorithm based on the PlaneSweep approach. We demonstrate that the new one-pass algorithm reduces the cost of capturing a full dependency relation of a query, regardless of the input representations. Although our proposed methods can capture higher-ordered proximity features efficiently, the trade-offs between effectiveness and efficiency when using proximity-based models remains largely unexplored. We consider different variants of proximity statistics and demonstrate that using local proximity statistics can achieve an improved trade-off between effectiveness and efficiency. Another important aspect in IR is reliable system comparisons. We conduct a series of experiments that explore the interaction between pooling and evaluation depth, interactions between evaluation metrics and evaluation depth and also correlations between two different evaluation metrics. We show that different evaluation configurations on large test collections, where only a limited number of relevance labels are available, can lead to different system comparison conclusions. We also demonstrate the pitfalls of choosing an arbitrary evaluation depth regardless of the metrics employed and the pooling depth of the test collections. Lastly, we provide suggestions on the evaluation configurations for the reliable comparisons of retrieval systems on large test collections. On these large test collections, a shallow judgment pool may be employed as assumed budgets are often limited, which may lead to an imprecise evaluation of system performance, especially when a deep evaluation metric is used. We propose an estimation framework for estimating deep metric score on shallow judgment pools. With an initial shallow judgment pool, rank-level estimators are designed to estimate the effectiveness gain at each ranking. Based on the rank-level estimations, we propose an optimization framework to obtain a more precise score estimate

    A corpus-based study of academic-collocation use and patterns in postgraduate Computer Science students’ writing

    Get PDF
    Collocation has been considered a problematic area for L2 learners. Various studies have been conducted to investigate native speakers’ (NS) and non-native speakers’ (NNS) use of different types of collocations (e.g., Durrant and Schmitt, 2009; Laufer and Waldman, 2011).These studies have indicated that, unlike NS, NNS rely on a limited set of collocations and tend to overuse them. This raises the question: if NNS tend to overuse a limited set of collocations in their academic writing, would their use of academic collocations in a specific discipline (Computer Science in this study) vary from that of NS and expert writers? This study has three main aims. First, it investigates the use of lexical academic collocations in NNS and NS Computer Science students’ MSc dissertations and compares their uses with those by expert writers in their writing of published research articles. Second, it explores the factors behind the over/underuse of the 24shared lexical collocations among corpora. Third, it develops awareness-raising activities that could be used to help non-expert NNS students with collocation over/underuse problems. For this purpose, a corpus of 600,000 words was compiled from 55 dissertations (26 written by NS and 29 by NNS). For comparison purposes, a reference corpus of 600,269 words was compiled from 63 research articles from prestigious high impact factor Computer Science academic journals. The Academic Word List (AWL) (Coxhead, 2000) was used to develop lists of the most frequent academic words in the student corpora, whose collocations were examined. Quantitative analysis was then carried out by comparing the 100 most frequent noun and verb collocations from each of the student corpora with the reference corpus. The results reveal that both NNS (52%) and NS (78%) students overuse noun collocations compared to the expert writers in the reference corpus. They underuse only a small number of noun collocations (8%). Surprisingly, neither NNS nor NS students significantly over/underused verb collocations compared to the reference corpus. In order to achieve the second aim, mixed methods approach was adopted. First, the variant patterns of the 24 shared noun collocations between NNS and NS corpora were identified to determine whether over/underuse of these collocations could be explained by their differences in the number of patterns used. Approximately half of the 24 collocations identified for their patterns were using more patterns including (Noun + preposition +Noun and Noun + adjective +Noun) that were rarely located in the writing of experts. Second, a categorisation judgement task and semi-structured interviews were carried out with three Computer Scientists to elicit their views on the various factors likely influencing noun collocation choices by the writers across the corpora. Results demonstrate that three main factors could explain the variation: sub-discipline, topic, and genre. To achieve the third pedagogical aim, a sample of awareness-raising activities was designed for the problematic over/underuse of some noun collocations. Using the corpus-based Data Driven Learning (DDL)approach (Johns,1991), three types of awareness-raising activities were developed: noticing collocation, noticing and identifying different patterns of the same collocation, and comparing and contrasting patterns between NNS students’ corpora and the reference corpus. Results of this study suggest that academic collocation use in an ESP context (Computer Science) is related to other factors than students’ lack of knowledge of collocations. Expertness, genre variation, topic and discipline-specific collocations are proved important factors to be considered in ESP. Thus, ESP teachers have to alert their students to the effect of these factors in academic collocation use in subject specific disciplines. This has tangible implications for Applied Linguistics and for teaching practices
    corecore