8 research outputs found

    Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes

    Full text link
    Full-text search engines are important tools for information retrieval. Term proximity is an important factor in relevance score measurement. In a proximity full-text search, we assume that a relevant document contains query terms near each other, especially if the query terms are frequently occurring words. A methodology for high-performance full-text query execution is discussed. We build additional indexes to achieve better efficiency. For a word that occurs in the text, we include in the indexes some information about nearby words. What types of additional indexes do we use? How do we use them? These questions are discussed in this work. We present the results of experiments showing that the average time of search query execution is 44-45 times less than that required when using ordinary inverted indexes. This is a pre-print of a contribution "Veretennikov A.B. Proximity Full-Text Search with a Response Time Guarantee by Means of Additional Indexes" published in "Arai K., Kapoor S., Bhatia R. (eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Computing, vol 868" published by Springer, Cham. The final authenticated version is available online at: https://doi.org/10.1007/978-3-030-01054-6_66. The work was supported by Act 211 Government of the Russian Federation, contract no 02.A03.21.0006.Comment: Alexander B. Veretennikov. Chair of Calculation Mathematics and Computer Science, INSM. Ural Federal Universit

    Out of the box phrase indexing

    Get PDF
    Abstract. We present a method for optimizing inverted index based search engines with respect to phrase querying performance. Our approach adds carefully selected two-term phrases to an existing index. While competitive previous work is mainly based on the analysis of query logs, our approach comes out of the box and uses just the information already contained in the index. Even so, our method can compete with previous work in terms of querying performance and actually, it can get ahead of those for difficult queries. Moreover, our selection process gives performance guarantees for arbitrary queries. In a further step, we propose to use a phrase index as a substitute for the positional index of an in-memory search engine containing just short documents. We confirm all of our considerations by experiments on a high-performance mainmemory search engine. However, we believe that our approach can be applied to classical disk based systems as well

    Architecture for Efficient String Dictionaries in E-Learning

    Get PDF
    E-Learning is a response to the new educational needs of society and an important development in Information and Communication Technologies. However, this trend presents many challenges, such as the lack of an architecture that allows a unified management of heterogeneous string dictionaries required by all the users of e-learning environments, which we face in this paper. We mean the string dictionaries needed in information retrieval, content development, “key performance indicators” generation and course management applications. As an example, our approach can deal with different indexation dictionaries required by the course contents and the different online forums that generate a huge number of messages with an unordered structure and a great variety of topics. Our architecture will generate an only dictionary that is shared by all the stakeholders involved in the e-learning process.This work was supported in part by the Spanish Ministry of Economy and Competitiveness (MINECO) under Project SEQUOIA-UA (TIN2015-63502-C3-3-R), Project RESCATA (TIN2015-65100-R) and Project PROMETEO/2018/089; and in part by the Spanish Research Agency (AEI) and the European Regional Development Fund (FEDER) under Project CloudDriver4Industry (TIN2017-89266-R)

    Real-time Text Queries with Tunable Term Pair Indexes

    No full text
    Term proximity scoring is an established means in information retrieval for improving result quality of full-text queries. Integrating such proximity scores into efficient query processing, however, has not been equally well studied. Existing methods make use of precomputed lists of documents where tuples of terms, usually pairs, occur together, usually incurring a huge index size compared to term-only indexes. This paper introduces a joint framework for trading off index size and result quality, and provides optimization techniques for tuning precomputed indexes towards either maximal result quality or maximal query processing performance, given an upper bound for the index size. The framework allows to selectively materialize lists for pairs based on a query log to further reduce index size. Extensive experiments with two large text collections demonstrate runtime improvements of several orders of magnitude over existing text-based processing techniques with reasonable index sizes

    MergedTrie: Efficient textual indexing

    Get PDF
    The accessing and processing of textual information (i.e. the storing and querying of a set of strings) is especially important for many current applications (e.g. information retrieval and social networks), especially when working in the fields of Big Data or IoT, which require the handling of very large string dictionaries. Typical data structures for textual indexing are Hash Tables and some variants of Tries such as the Double Trie (DT). In this paper, we propose an extension of the DT that we have called MergedTrie. It improves the DT compression by merging both Tries into a single and by segmenting the indexed term into two fixed length parts in order to balance the new Trie. Thus, a higher overlapping of both prefixes and suffixes is obtained. Moreover, we propose a new implementation of Tries that achieves better compression rates than the Double-Array representation usually chosen for implementing Tries. Our proposal also overcomes the limitation of static implementations that does not allow insertions and updates in their compact representations. Finally, our MergedTrie implementation experimentally improves the efficiency of the Hash Tables, the DTs, the Double-Array, the Crit-bit, the Directed Acyclic Word Graphs (DAWG), and the Acyclic Deterministic Finite Automata (ADFA) data structures, requiring less space than the original text to be indexed.This study has been partially funded by the SEQUOIA-UA (TIN2015-63502-C3-3-R) and the RESCATA (TIN2015-65100-R) projects of the Spanish Ministry of Economy and Competitiveness (MINECO)

    Making Sense of Social Events by Event monitoring, Visualization and Underlying Community Profiling

    Get PDF
    With the prevalence of intelligent devices, social networks have been playing an increasingly important role in our daily life. Various social networks (e.g., Twitter, Facebook) provide convenient platforms for users to explore the world. In this thesis, we study the problem of multi-perspective analysis of social events detected from social networks. In particular, we aim to make sense of the social events from the following three perspectives: 1) what are these social events about; 2) how do these events evolve along timeline; 3) who are involved in the discussions on these events. We mainly work on two categories of social data: the user-generated contents such as tweets and Facebook posts, and the users' interactions such as the follow and reply behaviours among users. On one hand, the posts reveal valuable information that describes the evolutions of miscellaneous social events, which is crucial for people to understand the world. On the other hand, users' interactions demonstrate users' relationships among each other and thus provide opportunities for analysing the underlying communities behind the social events. However, it is not practical to manually detect social events, monitor event evolutions or profile the underlying communities from the massive amount of social data generated everyday. Hence, how to efficiently and effectively extract, manage and analyse the useful information from the social data for multi-perspective social events understanding is of great importance. The social data is dynamic source of information which enables people to stay informed of what is happening now and who are the active and influential users discussing these social events. For one thing, social data is generated by people worldwide at all time, which may make fast identification of events even before the mainstream media. Moreover, the continuous stream of social data reflects the event evolutions and characterizes the events with changing opinions at different stages. This provides an opportunity to people for timely responses to urgent events. For another, users are often not isolated in social networks. The interactions between users can be utilized to discover the communities who discuss each social event. Underlying community profiling provides answers to the questions like who are interested in these events, and which group of people are the most influential users in spreading certain event topics. These answers deepen our understanding of the social events by considering not only the events themselves but also the users behind these events. The first research task in this thesis is to monitor and index the evolving events from social textual contents. The social data cover a wide variety of events which typically evolve over time. Although event detection has been actively studied, most existing approaches do not track the evolution of events, nor do they address the issue of efficient monitoring in the presence of a large number of events. In this task, we detect events based on the user-generated textual contents and design four event operations to capture the dynamics of events. Moreover, we propose a novel event indexing structure, called Multi-layer Inverted List, to manage dynamic event databases for the acceleration of large-scale event search and update. The second research task is to explore multiple features for social events tracking and visualization. In addition to textual contents utilized in the first task, social data contains various features, such as images and timestamps. The benefits of incorporating different features into event detection are twofold. First, these features provide supplemental information that facilitates the event detection model. Second, different features describe the detected events from different aspects, which enables users to have a better understanding with more vivid visualizations. To improve the event detection performance, we propose a novel generative probabilistic model which jointly models five different features. The event evolution tracking is achieved by applying the maximum-weighted bipartite graph matching on the events discovered in consecutive periods. Events are then visualized by the representative images selected based on our three defined criteria. The third research task is to detect and profile the underlying social communities in social events. The social data not only contains user-generated contents which describe the events evolutions, but also comprises various information on the users who discuss these events, such as user attributes, user behaviours, and so on. Comprehensively utilizing this user information can help to group similar users into communities, and enrich the social event analysis from the community perspective. Motivated by the rich semantics about user behaviours hidden in social data, we extend the community definition as a group of users who are not only densely connected, but also having similar behaviours. Moreover, in addition to detecting the communities, we further profile each of the detected communities for social events analysis. A novel community profiling model is designed to detect and characterize a community by both content profile (what a community is about) and diffusion profile (how it interacts with others)

    Indexing methods for web archives

    Get PDF
    There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text reposi- tories. Web archives are such continuously growing text collections which contain ver- sions of documents spanning over long time periods. Web archives present many op- portunities for historical, cultural and political analyses. Consequently there is a grow- ing need for tools which can efficiently access and search them. In this work, we are interested in indexing methods for supporting text-search work- loads over web archives like time-travel queries and phrase queries. To this end we make the following contributions: • Time-travel queries are keyword queries with a temporal predicate, e.g., “mpii saarland” @ [06/2009], which return versions of documents in the past. We in- troduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections. • We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage. • We propose indexing methods to support phrase queries, e.g., “to be or not to be that is the question”. We index multi-word sequences and devise novel query- optimization methods over the indexed sequences to efficiently answer phrase queries. We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.In der jüngsten Vergangenheit gab es zahlreiche Bemühungen zuvor veröffentlichte Inhalte zu digitalisieren und elektronisch erstellte Inhalte zu erhalten. Dies führte zu einem weit verbreitenden Anstieg großer Textdatenbestände. Webarchive sind eine solche Art konstant ansteigender Textdatensammlung. Sie enthalten mehrere Versionen von Dokumenten, welche sich über längere Zeiträume erstrecken. Darüber hinaus bieten sie viele Möglichkeiten für historische, kulturelle und politische Analysen. Infolgedessen gibt es einen wachsenden Bedarf an Werkzeugen, die eine effiziente Suche in Webarchiven und einen effizienten Zugriff auf die Daten erlauben. Der Fokus dieser Arbeit liegt auf Indexierungsverfahren, um die Arbeitslast von Textsuche auf Webarchiven zu unterstützen, wie zum Beispiel time-travel queries oder phrase queries. Zu diesem Zweck leisten wir folgende Beiträge: • Time-travel queries sind Suchwortanfragen mit einem temporalen Prädikat. Zum Beispiel liefert die Anfrage “mpii saarland” @ [06/2009] Versionen des Dokuments aus der Vergangenheit als Ergebnis. Zur effizienten Unterstützung solcher Anfragen ohne die Indexgröße aufzublasen, stellen wir eine neue Strategie zur Organisation von Indizes dar, so genanntes index sharding. Des Weiteren schlagen wir Wartungsverfahren für Indizes vor, die für solch konstant wachsende Datensätze skalieren. • WirentwickelnTechnikenzurAnfrageoptimierungvontime-travelqueries, nachstehend partition selection genannt. Diese maximieren den Recall in jeder Phase der Anfrageverarbeitung. • Wir stellen Indexierungsmethoden vor, die phrase queries unterstützen, z. B. “Sein oder Nichtsein, das ist hier die Frage”. Wir indexieren Sequenzen bestehend aus mehreren Wörtern und entwerfen neue Optimierungsverfahren für die indexierten Sequenzen, um phrase queries effizient zu beantworten. Die Performanz dieser Verfahren wird anhand von ausführlichen Experimenten auf realen Webarchiven demonstriert
    corecore