287 research outputs found

    Learning to merge search results for efficient Distributed Information Retrieval

    Get PDF
    Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which would learn a function that merges results based on information that is readily available: i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM-methods do not improve over those baselines

    A Decade of Shared Tasks in Digital Text Forensics at PAN

    Full text link
    [EN] Digital text forensics aims at examining the originality and credibility of information in electronic documents and, in this regard, to extract and analyze information about the authors of these documents. The research field has been substantially developed during the last decade. PAN is a series of shared tasks that started in 2009 and significantly contributed to attract the attention of the research community in well-defined digital text forensics tasks. Several benchmark datasets have been developed to assess the state-of-the-art performance in a wide range of tasks. In this paper, we present the evolution of both the examined tasks and the developed datasets during the last decade. We also briefly introduce the upcoming PAN 2019 shared tasks.We are indebted to many colleagues and friends who contributed greatly to PAN's tasks: Maik Anderka, Shlomo Argamon, Alberto Barrón-Cedeño, Fabio Celli, Fabio Crestani, Walter Daelemans, Andreas Eiselt, Tim Gollub, Parth Gupta, Matthias Hagen, Teresa Holfeld, Patrick Juola, Giacomo Inches, Mike Kestemont, Moshe Koppel, Manuel Montes-y-Gómez, Aurelio Lopez-Lopez, Francisco Rangel, Miguel Angel Sánchez-Pérez, Günther Specht, Michael Tschuggnall, and Ben Verhoeven. Our special thanks go to PAN¿s sponsors throughout the years and not least to the hundreds of participants.Potthast, M.; Rosso, P.; Stamatatos, E.; Stein, B. (2019). A Decade of Shared Tasks in Digital Text Forensics at PAN. Lecture Notes in Computer Science. 11438:291-300. https://doi.org/10.1007/978-3-030-15719-7_39S2913001143

    Time-based Microblog Distillation

    Get PDF
    This paper presents a simple approach for identifying relevant and reliable news from the Twitter stream, as soon as they emerge. The approach is based on a near-real time systems for sentiment analysis on Twitter, implemented by Fondazione Ugo Bordoni, and properly modified in order to detect the most representative tweets in a specified time slot. This work represents a first step towards the implementation of a prototype supporting journalists in discovering and finding news on Twitter

    Web document summarisation: a task-oriented evaluation

    Get PDF
    We present a query-biased summarisation interface for Web searching. The summarisation system has been specifically developed to act as a component in existing Web search interfaces. The summaries allow the user to more effectively assess the content of Web pages. We also present an experimental investigation of this approach. Our experimental results shows the system appears to be more useful and effective in helping users gauge document relevance than the traditional ranked titles/abstracts approach

    A Comparative Analysis of Retrievability and PageRank Measures

    Full text link
    The accessibility of documents within a collection holds a pivotal role in Information Retrieval, signifying the ease of locating specific content in a collection of documents. This accessibility can be achieved via two distinct avenues. The first is through some retrieval model using a keyword or other feature-based search, and the other is where a document can be navigated using links associated with them, if available. Metrics such as PageRank, Hub, and Authority illuminate the pathways through which documents can be discovered within the network of content while the concept of Retrievability is used to quantify the ease with which a document can be found by a retrieval model. In this paper, we compare these two perspectives, PageRank and retrievability, as they quantify the importance and discoverability of content in a corpus. Through empirical experimentation on benchmark datasets, we demonstrate a subtle similarity between retrievability and PageRank particularly distinguishable for larger datasets.Comment: Accepted at FIRE 202

    Search of spoken documents retrieves well recognized transcripts

    Get PDF
    This paper presents a series of analyses and experiments on spoken document retrieval systems: search engines that retrieve transcripts produced by speech recognizers. Results show that transcripts that match queries well tend to be recognized more accurately than transcripts that match a query less well. This result was described in past literature, however, no study or explanation of the effect has been provided until now. This paper provides such an analysis showing a relationship between word error rate and query length. The paper expands on past research by increasing the number of recognitions systems that are tested as well as showing the effect in an operational speech retrieval system. Potential future lines of enquiry are also described
    • …
    corecore