287 research outputs found
Learning to merge search results for efficient Distributed Information Retrieval
Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which would learn a function that merges results based on information that is readily available: i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM-methods do not improve over those baselines
A Decade of Shared Tasks in Digital Text Forensics at PAN
[EN] Digital text forensics aims at examining the originality and
credibility of information in electronic documents and, in this regard, to extract and analyze information about the authors of these documents. The research field has been substantially developed during the last decade. PAN is a series of shared tasks that started in 2009 and significantly contributed to attract the attention of the research community in well-defined digital text forensics tasks. Several benchmark datasets have been developed to assess the state-of-the-art performance in a wide range of tasks. In this paper, we present the evolution of both the examined tasks and the developed datasets during the last decade. We also briefly introduce the upcoming PAN 2019 shared tasks.We are indebted to many colleagues and friends who contributed greatly to PAN's tasks: Maik Anderka, Shlomo Argamon, Alberto Barrón-Cedeño, Fabio Celli, Fabio Crestani, Walter Daelemans, Andreas Eiselt, Tim Gollub,
Parth Gupta, Matthias Hagen, Teresa Holfeld, Patrick Juola, Giacomo Inches, Mike
Kestemont, Moshe Koppel, Manuel Montes-y-GĂłmez, Aurelio Lopez-Lopez, Francisco
Rangel, Miguel Angel Sánchez-Pérez, Günther Specht, Michael Tschuggnall, and Ben
Verhoeven. Our special thanks go to PANÂżs sponsors throughout the years and not
least to the hundreds of participants.Potthast, M.; Rosso, P.; Stamatatos, E.; Stein, B. (2019). A Decade of Shared Tasks in Digital Text Forensics at PAN. Lecture Notes in Computer Science. 11438:291-300. https://doi.org/10.1007/978-3-030-15719-7_39S2913001143
Time-based Microblog Distillation
This paper presents a simple approach for identifying relevant and reliable news from the Twitter stream, as soon as they emerge. The approach is based on a near-real time systems for sentiment analysis on Twitter, implemented by Fondazione Ugo Bordoni, and properly modified in order to detect the most representative tweets in a specified time slot.
This work represents a first step towards the implementation of a prototype supporting journalists in discovering and finding news on
Twitter
Web document summarisation: a task-oriented evaluation
We present a query-biased summarisation interface for Web searching. The summarisation system has been specifically developed to act as a component in existing Web search interfaces. The summaries allow the user to more effectively assess the content of Web pages. We also present an experimental investigation of this approach. Our experimental results shows the system appears to be more useful and effective in helping users gauge document relevance than the traditional ranked titles/abstracts approach
A Comparative Analysis of Retrievability and PageRank Measures
The accessibility of documents within a collection holds a pivotal role in
Information Retrieval, signifying the ease of locating specific content in a
collection of documents. This accessibility can be achieved via two distinct
avenues. The first is through some retrieval model using a keyword or other
feature-based search, and the other is where a document can be navigated using
links associated with them, if available. Metrics such as PageRank, Hub, and
Authority illuminate the pathways through which documents can be discovered
within the network of content while the concept of Retrievability is used to
quantify the ease with which a document can be found by a retrieval model. In
this paper, we compare these two perspectives, PageRank and retrievability, as
they quantify the importance and discoverability of content in a corpus.
Through empirical experimentation on benchmark datasets, we demonstrate a
subtle similarity between retrievability and PageRank particularly
distinguishable for larger datasets.Comment: Accepted at FIRE 202
Search of spoken documents retrieves well recognized transcripts
This paper presents a series of analyses and experiments on spoken
document retrieval systems: search engines that retrieve transcripts produced by
speech recognizers. Results show that transcripts that match queries well tend to
be recognized more accurately than transcripts that match a query less well.
This result was described in past literature, however, no study or explanation of
the effect has been provided until now. This paper provides such an analysis
showing a relationship between word error rate and query length. The paper
expands on past research by increasing the number of recognitions systems that
are tested as well as showing the effect in an operational speech retrieval
system. Potential future lines of enquiry are also described
- …