83,621 research outputs found

    Temporal search in web archives

    Get PDF
    Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfold their full potential, search techniques are needed that consider their inherent special characteristics. This work addresses three important problems toward this objective and makes the following contributions: - We present the Time-Travel Inverted indeX (TTIX) as an efficient solution to time-travel text search in web archives, allowing users to search only the parts of the web archive that existed at a user's time of interest. - To counter negative effects that terminology evolution has on the quality of search results in web archives, we propose a novel query-reformulation technique, so that old but highly relevant documents are retrieved in response to today's queries. - For temporal information needs, for which the user is best satisfied by documents that refer to particular times, we describe a retrieval model that integrates temporal expressions (e.g., "in the 1990s") seamlessly into a language modelling approach. Experiments for each of the proposed methods show their efficiency and effectiveness, respectively, and demonstrate the viability of our approach to search in web archives.Webarchive bezeichnen einerseits Archive ursprünglich im Web veröffentlichter Inhalte (z. B. das Internet Archive), andererseits Archive, die vor langer Zeit veröffentlichter Inhalte im Web zugreifbar machen (z. B. das Archiv von The Times). Ein gewachsenes Bewusstein, dass originär digitale Inhalte bewahrenswert sind, sowie verbesserte Digitalisierungsverfahren haben dazu geführt, dass Anzahl und Umfang von Webarchiven zugenommen haben. Um das volle Potenzial von Webarchiven auszuschöpfen, bedarf es durchdachter Suchverfahren. Diese Arbeit befasst sich mit drei relevanten Teilproblemen und leistet die folgenden Beiträge: - Vorstellung des Time-Travel Inverted indeX (TTIX) als eine Erweiterung des invertierten Index, um Zeitreise-Textsuche auf Webarchiven effizient zu unterstützen. - Eine neue Methode zur automatischen Umformulierung von Suchanfragen, um negativen Auswirkungen entgegenzuwirken, die eine fortwährende Terminologieveränderung auf die Ergebnisgüte beim Suchen in Webarchiven hat. - Ein Retrieval-Modell, welches speziell auf Informationsbedürfnisse mit deutlichem Zeitbezug ausgerichtet ist. Dieses Retrieval-Modell bedient sich in Dokumenten enthaltener Zeitbezüge (z. B. "in the 1990s") und fügt diese nahtlos in einen auf Language Models beruhenden Retrieval-Ansatz ein. Zahlreiche Experimente zeigen die Effizienz bzw. Effektivität der genannten Beiträge und demonstrieren den praktischen Nutzen der vorgestellten Verfahren

    Temporal search in web archives

    Get PDF
    Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfold their full potential, search techniques are needed that consider their inherent special characteristics. This work addresses three important problems toward this objective and makes the following contributions: - We present the Time-Travel Inverted indeX (TTIX) as an efficient solution to time-travel text search in web archives, allowing users to search only the parts of the web archive that existed at a user's time of interest. - To counter negative effects that terminology evolution has on the quality of search results in web archives, we propose a novel query-reformulation technique, so that old but highly relevant documents are retrieved in response to today's queries. - For temporal information needs, for which the user is best satisfied by documents that refer to particular times, we describe a retrieval model that integrates temporal expressions (e.g., "in the 1990s") seamlessly into a language modelling approach. Experiments for each of the proposed methods show their efficiency and effectiveness, respectively, and demonstrate the viability of our approach to search in web archives.Webarchive bezeichnen einerseits Archive ursprünglich im Web veröffentlichter Inhalte (z. B. das Internet Archive), andererseits Archive, die vor langer Zeit veröffentlichter Inhalte im Web zugreifbar machen (z. B. das Archiv von The Times). Ein gewachsenes Bewusstein, dass originär digitale Inhalte bewahrenswert sind, sowie verbesserte Digitalisierungsverfahren haben dazu geführt, dass Anzahl und Umfang von Webarchiven zugenommen haben. Um das volle Potenzial von Webarchiven auszuschöpfen, bedarf es durchdachter Suchverfahren. Diese Arbeit befasst sich mit drei relevanten Teilproblemen und leistet die folgenden Beiträge: - Vorstellung des Time-Travel Inverted indeX (TTIX) als eine Erweiterung des invertierten Index, um Zeitreise-Textsuche auf Webarchiven effizient zu unterstützen. - Eine neue Methode zur automatischen Umformulierung von Suchanfragen, um negativen Auswirkungen entgegenzuwirken, die eine fortwährende Terminologieveränderung auf die Ergebnisgüte beim Suchen in Webarchiven hat. - Ein Retrieval-Modell, welches speziell auf Informationsbedürfnisse mit deutlichem Zeitbezug ausgerichtet ist. Dieses Retrieval-Modell bedient sich in Dokumenten enthaltener Zeitbezüge (z. B. "in the 1990s") und fügt diese nahtlos in einen auf Language Models beruhenden Retrieval-Ansatz ein. Zahlreiche Experimente zeigen die Effizienz bzw. Effektivität der genannten Beiträge und demonstrieren den praktischen Nutzen der vorgestellten Verfahren

    Temporal search in web archives

    Get PDF
    Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfold their full potential, search techniques are needed that consider their inherent special characteristics. This work addresses three important problems toward this objective and makes the following contributions: - We present the Time-Travel Inverted indeX (TTIX) as an efficient solution to time-travel text search in web archives, allowing users to search only the parts of the web archive that existed at a user's time of interest. - To counter negative effects that terminology evolution has on the quality of search results in web archives, we propose a novel query-reformulation technique, so that old but highly relevant documents are retrieved in response to today's queries. - For temporal information needs, for which the user is best satisfied by documents that refer to particular times, we describe a retrieval model that integrates temporal expressions (e.g., "in the 1990s") seamlessly into a language modelling approach. Experiments for each of the proposed methods show their efficiency and effectiveness, respectively, and demonstrate the viability of our approach to search in web archives.Webarchive bezeichnen einerseits Archive ursprünglich im Web veröffentlichter Inhalte (z. B. das Internet Archive), andererseits Archive, die vor langer Zeit veröffentlichter Inhalte im Web zugreifbar machen (z. B. das Archiv von The Times). Ein gewachsenes Bewusstein, dass originär digitale Inhalte bewahrenswert sind, sowie verbesserte Digitalisierungsverfahren haben dazu geführt, dass Anzahl und Umfang von Webarchiven zugenommen haben. Um das volle Potenzial von Webarchiven auszuschöpfen, bedarf es durchdachter Suchverfahren. Diese Arbeit befasst sich mit drei relevanten Teilproblemen und leistet die folgenden Beiträge: - Vorstellung des Time-Travel Inverted indeX (TTIX) als eine Erweiterung des invertierten Index, um Zeitreise-Textsuche auf Webarchiven effizient zu unterstützen. - Eine neue Methode zur automatischen Umformulierung von Suchanfragen, um negativen Auswirkungen entgegenzuwirken, die eine fortwährende Terminologieveränderung auf die Ergebnisgüte beim Suchen in Webarchiven hat. - Ein Retrieval-Modell, welches speziell auf Informationsbedürfnisse mit deutlichem Zeitbezug ausgerichtet ist. Dieses Retrieval-Modell bedient sich in Dokumenten enthaltener Zeitbezüge (z. B. "in the 1990s") und fügt diese nahtlos in einen auf Language Models beruhenden Retrieval-Ansatz ein. Zahlreiche Experimente zeigen die Effizienz bzw. Effektivität der genannten Beiträge und demonstrieren den praktischen Nutzen der vorgestellten Verfahren

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time

    Ranking Archived Documents for Structured Queries on Semantic Layers

    Full text link
    Archived collections of documents (like newspaper and web archives) serve as important information sources in a variety of disciplines, including Digital Humanities, Historical Science, and Journalism. However, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into usable sources of information. A semantic layer is an RDF graph that describes metadata and semantic information about a collection of archived documents, which in turn can be queried through a semantic query language (SPARQL). This allows running advanced queries by combining metadata of the documents (like publication date) and content-based semantic information (like entities mentioned in the documents). However, the results returned by such structured queries can be numerous and moreover they all equally match the query. In this paper, we deal with this problem and formalize the task of "ranking archived documents for structured queries on semantic layers". Then, we propose two ranking models for the problem at hand which jointly consider: i) the relativeness of documents to entities, ii) the timeliness of documents, and iii) the temporal relations among the entities. The experimental results on a new evaluation dataset show the effectiveness of the proposed models and allow us to understand their limitation
    corecore