13 research outputs found
Building Cultural Heritage Reference Collections from Social Media through Pooling Strategies: The Case of 2020’s Tensions Over Race and Heritage
Preprint del artículo[Abstract] Social networks constitute a valuable source for documenting heritage constitution processes or obtaining a real-time snapshot of a cultural heritage research topic. Many heritage researchers use social networks as a social thermometer to study these processes, creating, for this purpose, collections that constitute born-digital archives potentially reusable, searchable, and of interest to other researchers or citizens. However, retrieval and archiving techniques used in social networks within heritage studies are still semi-manual, being a time-consuming task and hindering the reproducibility, evaluation, and open-up of the collections created. By combining Information Retrieval strategies with emerging archival techniques, some of these weaknesses can be left behind. Specifically, pooling is a well-known Information Retrieval method to extract a sample of documents from an entire document set (posts in case of social network's information), obtaining the most complete and unbiased set of relevant documents on a given topic. Using this approach, researchers could create a reference collection while avoiding annotating the entire corpus of documents or posts retrieved. This is especially useful in social media due to the large number of topics treated by the same user or in the same thread or post. We present a platform for applying pooling strategies combined with expert judgment to create cultural heritage reference collections from social networks in a customisable, reproducible, documented, and shareable way. The platform is validated by building a reference collection from a social network about the recent attacks on patrimonial entities motivated by anti-racist protests. This reference collection and the results obtained from its preliminary study are available for use. This real application has allowed us to validate the platform and the pooling strategies for creating reference collections in heritage studies from social networks.This research has received financial support from: (i) Saving European Archaeology from the Digital Dark Age (SEADDA) 2019-2023 COST ACTION CA 18128; (ii) “Ministerio de Ciencia, Innovación y Universidades” of the Government of Spain and the ERDF (projects RTI2018-093336-B-C21 and RTI2018-093336-B-C22); (iii) Xunta de Galicia - “Consellería de Cultura, Educación e Universidade” (project GPC ED431B 2019/03); (iv) Xunta de Galicia - “Consellería de Cultura, Educación e Universidade” and the ERDF (“Centro Singular de Investigación de Galicia” accreditation ED431G 2019/01)European Cooperation in Science and Technology; CA18128Xunta de Galicia; ED431B 2019/03Xunta de Galicia; ED431G 2019/0
When to stop making relevance judgments? A study of stopping methods for building information retrieval test collections
This is the peer reviewed version of the following article: David E. Losada, Javier Parapar and Alvaro Barreiro (2019) When to Stop Making Relevance Judgments? A Study of Stopping Methods for Building Information Retrieval Test Collections. Journal of the Association for Information Science and Technology, 70 (1), 49-60, which has been published in final form at https://doi.org/10.1002/asi.24077. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived VersionsIn information retrieval evaluation, pooling is a well‐known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this article we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100, and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very highThis work received financial support from the (i) “Ministerio de Economía y Competitividad” of the Government of Spain and FEDER Funds under the researchproject TIN2015-64282-R, (ii) Xunta de Galicia (project GPC 2016/035), and (iii) Xunta de Galicia “Consellería deCultura, Educación e Ordenación Universitaria” and theEuropean Regional Development Fund (ERDF) throughthe following 2016–2019 accreditations: ED431G/01(“Centro singular de investigación de Galicia”) andED431G/08S
Progettazione e realizzazione di un'applicazione per la raccolta e il campionamento di pagine web
I primi sistemi di Information Retrieval lavoravano su collezioni di qualità omogenea come documenti giuridici e articoli medici. Con l’avvento del web, le tecniche tradizionali di reperimento dell’informazione sono risultate poco efficaci in quanto incapaci di distinguere la qualità dei documenti; di qui la necessità di
ideare algoritmi in grado di selezionare le pagine web in base sia alla rilevanza che alla qualità. Tra questi algoritmi, un posto di rilievo hanno assunto quelli di link analysis, che cercano di inferire la qualità delle pagine web dalla struttura topologica del grafo associato al web. Il lavoro descritto in questa relazione è stato svolto all’interno di un progetto che ha lo scopo di valutare l’effettiva efficacia di tali algoritmi.
Il nostro lavoro è consistito nello sviluppo di un’applicazione web che, data un’opportuna popolazione di pagine web, metterà a disposizione una serie di funzionalità mirate alla raccolta di giudizi sulla qualità delle pagine stesse. Il software citato esegue una pre-elaborazione dei risultati restituiti dai motori di
ricerca e a tal proposito sono stati sviluppati tre moduli: Interrogatore, che si preoccuperà di estrapolare gli URL dai risultati; Campionatore che, data una teoria euristica ragionevole, filtrerà i risultati restituiti dall’Interrogatore e infine
Downloader che si occuperà di memorizzare le pagine su disc
Recommended from our members
A collaborative approach to IR evaluation
textIn this thesis we investigate two main problems: 1) inferring consensus from disparate inputs to improve quality of crowd contributed data; and 2) developing a reliable crowd-aided IR evaluation framework.
With regard to the first contribution, while many statistical label aggregation methods have been proposed, little comparative benchmarking has occurred in the community making it difficult to determine the state-of-the-art in consensus or to quantify novelty and progress, leaving modern systems to adopt simple control strategies. To aid the progress of statistical consensus and make state-of-the-art methods accessible, we develop a benchmarking framework in SQUARE, an open source shared task framework including benchmark datasets, defined tasks, standard metrics, and reference implementations with empirical results for several popular methods. Through the development of SQUARE we propose a crowd simulation model that emulates real crowd environments to enable rapid and reliable experimentation of collaborative methods with different crowd contributions. We apply the findings of the benchmark to develop reliable crowd contributed test collections for IR evaluation.
As our second contribution, we describe a collaborative model for distributing relevance judging tasks between trusted assessors and crowd judges. Based on prior work's hypothesis of judging disagreements on borderline documents, we train a logistic regression model to predict assessor disagreement, prioritizing judging tasks by expected disagreement. Judgments are generated from different crowd models and intelligently aggregated. Given a priority queue, a judging budget, and a ratio for expert vs. crowd judging costs, critical judging tasks are assigned to trusted assessors with the crowd supplying remaining judgments. Results on two TREC datasets show significant judging burden can be confidently shifted to the crowd, achieving high rank correlation and often at lower cost vs. exclusive use of trusted assessors.Computer Science
Stopping methods for technology assisted reviews based on point processes
Technology Assisted Review (TAR), which aims to reduce the effort required to screen collections of documents for relevance, is used to develop systematic reviews of medical evidence and identify documents that must be disclosed in response to legal proceedings. Stopping methods are algorithms which determine when to stop screening documents during the TAR process, helping to ensure that workload is minimised while still achieving a high level of recall. This paper proposes a novel stopping method based on point processes, which are statistical models that can be used to represent the occurrence of random events. The approach uses rate functions to model the occurrence of relevant documents in the ranking and compares four candidates, including one that has not previously been used for this purpose (hyperbolic). Evaluation is carried out using standard datasets (CLEF e-Health, TREC Total Recall, TREC Legal), and this work is the first to explore stopping method robustness by reporting performance on a range of rankings of varying effectiveness. Results show that the proposed method achieves the desired level of recall without requiring an excessive number of documents to be examined in the majority of cases and also compares well against multiple alternative approaches
On-line Metasearch, Pooling, and System Evaluation
This thesis presents a unified method for simultaneous solution of three problems in Information Retrieval--- metasearch (the fusion of ranked lists returned by retrieval systems to elicit improved performance), efficient system evaluation (the accurate evaluation of retrieval systems with small numbers of relevance judgements), and pooling or ``active sample selection (the selection of documents for manual judgement in order to develop sample pools of high precision or pools suitable for assessing system quality). The thesis establishes a unified theoretical framework for addressing these three problems and naturally generalizes their solution to the on-line context by incorporating feedback in the form of relevance judgements. The algorithm--- Rankhedge for on-line retrieval, metasearch and system evaluation--- is the first to address these three problems simultaneously and also to generalize their solution to the on-line context. Optimality of the Rankhedge algorithm is developed via Bayesian and maximum entropy interpretations. Results of the algorithm prove to be significantly superior to previous methods when tested over a range of TREC (Text REtrieval Conference) data. In the absence of feedback, the technique equals or exceeds the performance of benchmark metasearch algorithms such as CombMNZ and Condorcet. The technique then dramatically improves on this performance during the on-line metasearch process. In addition, the technique generates pools of documents which include more relevant documents and produce more accurate system evaluations than previous techniques. The thesis includes an information-theoretic examination of the original Hedge algorithm as well as its adaptation to the context of ranked lists. The work also addresses the concept of information-theoretic similarity within the Rankhedge context and presents a method for decorrelating the predictor set to improve worst case performance. Finally, an information-theoretically optimal method for probabilistic ``active sampling is presented with possible application to a broad range of practical and theoretical contexts
Department of Computer Science Activity 1998-2004
This report summarizes much of the research and teaching activity of the Department of Computer Science at Dartmouth College between late 1998 and late 2004. The material for this report was collected as part of the final report for NSF Institutional Infrastructure award EIA-9802068, which funded equipment and technical staff during that six-year period. This equipment and staff supported essentially all of the department\u27s research activity during that period
Filtering News from Document Streams: Evaluation Aspects and Modeled Stream Utility
Events like hurricanes, earthquakes,
or accidents can impact a large number of people. Not only are people in the
immediate vicinity of the event affected, but concerns about their well-being are
shared by the local government and well-wishers across the world.
The latest information about news events
could be of use to government and aid agencies in order to make informed decisions on
providing necessary support, security and relief. The general public
avails of news updates via dedicated news feeds or broadcasts, and lately,
via social media services
like Facebook or Twitter.
Retrieving the latest information about newsworthy events from the world-wide web
is thus of importance to a large section of society.
As new content on a multitude of topics is continuously being published on the web,
specific event related information needs to be filtered from the resulting
stream of documents.
We present in this thesis, a user-centric evaluation measure for
evaluating systems that filter news related information from document streams.
Our proposed evaluation measure, Modeled Stream Utility (MSU), models
users accessing information from a stream of sentences
produced by a news update filtering system.
The user model allows for simulating a large number of users with different
characteristic stream browsing behavior. Through simulation,
MSU estimates the utility of a system for an
average user browsing a stream of sentences.
Our results show that system performance is sensitive to a user population's
stream browsing behavior and that
existing evaluation metrics correspond to very specific types of user behavior.
To evaluate systems that filter sentences from a document stream,
we need a set of judged sentences. This judged set is
a subset of all the sentences returned by all systems, and is
typically constructed by pooling
together the highest quality sentences,
as determined by respective system assigned scores for each sentence.
Sentences in the pool are manually assessed and
the resulting set of judged sentences is then used to compute system performance metrics.
In this thesis, we investigate the effect of including duplicates of
judged sentences, into the judged set, on system performance evaluation. We also develop an
alternative pooling methodology, that given the MSU user model,
selects sentences for pooling based on the probability of a sentences being read by
modeled users.
Our research lays the foundation for interesting future work for utilizing
user-models in different aspects of evaluation of stream filtering systems.
The MSU measure enables incorporation of different
user models. Furthermore, the applicability of MSU could be extended through
calibration based on user
behavior
Evaluation Methodologies for Visual Information Retrieval and Annotation
Die automatisierte Evaluation von Informations-Retrieval-Systemen erlaubt
Performanz und Qualität der Informationsgewinnung zu bewerten. Bereits in
den 60er Jahren wurden erste Methodologien für die system-basierte
Evaluation aufgestellt und in den Cranfield Experimenten überprüft.
Heutzutage gehören Evaluation, Test und Qualitätsbewertung zu einem aktiven
Forschungsfeld mit erfolgreichen Evaluationskampagnen und etablierten
Methoden. Evaluationsmethoden fanden zunächst in der Bewertung von
Textanalyse-Systemen Anwendung. Mit dem rasanten Voranschreiten der
Digitalisierung wurden diese Methoden sukzessive auf die Evaluation von
Multimediaanalyse-Systeme übertragen. Dies geschah häufig, ohne die
Evaluationsmethoden in Frage zu stellen oder sie an die veränderten
Gegebenheiten der Multimediaanalyse anzupassen. Diese Arbeit beschäftigt
sich mit der system-basierten Evaluation von Indizierungssystemen für
Bildkollektionen. Sie adressiert drei Problemstellungen der Evaluation von
Annotationen: Nutzeranforderungen für das Suchen und Verschlagworten von
Bildern, Evaluationsmaße für die Qualitätsbewertung von
Indizierungssystemen und Anforderungen an die Erstellung visueller
Testkollektionen. Am Beispiel der Evaluation automatisierter
Photo-Annotationsverfahren werden relevante Konzepte mit Bezug zu
Nutzeranforderungen diskutiert, Möglichkeiten zur Erstellung einer
zuverlässigen Ground Truth bei geringem Kosten- und Zeitaufwand vorgestellt
und Evaluationsmaße zur Qualitätsbewertung eingeführt, analysiert und
experimentell verglichen. Traditionelle Maße zur Ermittlung der Performanz
werden in vier Dimensionen klassifiziert. Evaluationsmaße vergeben
üblicherweise binäre Kosten für korrekte und falsche Annotationen. Diese
Annahme steht im Widerspruch zu der Natur von Bildkonzepten. Das gemeinsame
Auftreten von Bildkonzepten bestimmt ihren semantischen Zusammenhang und
von daher sollten diese auch im Zusammenhang auf ihre Richtigkeit hin
überprüft werden. In dieser Arbeit wird aufgezeigt, wie semantische
Ähnlichkeiten visueller Konzepte automatisiert abgeschätzt und in den
Evaluationsprozess eingebracht werden können. Die Ergebnisse der Arbeit
inkludieren ein Nutzermodell für die konzeptbasierte Suche von Bildern,
eine vollständig bewertete Testkollektion und neue Evaluationsmaße für die
anforderungsgerechte Qualitätsbeurteilung von Bildanalysesystemen.Performance assessment plays a major role in the research on Information
Retrieval (IR) systems. Starting with the Cranfield experiments in the
early 60ies, methodologies for the system-based performance assessment
emerged and established themselves, resulting in an active research field
with a number of successful benchmarking activities. With the rise of the
digital age, procedures of text retrieval evaluation were often transferred
to multimedia retrieval evaluation without questioning their direct
applicability. This thesis investigates the problem of system-based
performance assessment of annotation approaches in generic image
collections. It addresses three important parts of annotation evaluation,
namely user requirements for the retrieval of annotated visual media,
performance measures for multi-label evaluation, and visual test
collections. Using the example of multi-label image annotation evaluation,
I discuss which concepts to employ for indexing, how to obtain a reliable
ground truth to moderate costs, and which evaluation measures are
appropriate. This is accompanied by a thorough analysis of related work on
system-based performance assessment in Visual Information Retrieval (VIR).
Traditional performance measures are classified into four dimensions and
investigated according to their appropriateness for visual annotation
evaluation. One of the main ideas in this thesis adheres to the common
assumption on the binary nature of the score prediction dimension in
annotation evaluation. However, the predicted concepts and the set of true
indexed concepts interrelate with each other. This work will show how to
utilise these semantic relationships for a fine-grained evaluation
scenario. Outcomes of this thesis result in a user model for concept-based
image retrieval, a fully assessed image annotation test collection, and a
number of novel performance measures for image annotation evaluation