Search CORE

102 research outputs found

Reproducibility in Machine Learning-Driven Research

Author: Kopeinik Simone
Kowald Dominik
Ross-Hellauer Tony
Semmelrock Harald
Theiler Dieter
Publication venue
Publication date: 19/07/2023
Field of study

Research is facing a reproducibility crisis, in which the results and findings of many studies are difficult or even impossible to reproduce. This is also the case in machine learning (ML) and artificial intelligence (AI) research. Often, this is the case due to unpublished data and/or source-code, and due to sensitivity to ML training conditions. Although different solutions to address this issue are discussed in the research community such as using ML platforms, the level of reproducibility in ML-driven research is not increasing substantially. Therefore, in this mini survey, we review the literature on reproducibility in ML-driven research with three main aims: (i) reflect on the current situation of ML reproducibility in various research fields, (ii) identify reproducibility issues and barriers that exist in these research fields applying ML, and (iii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility. With this, we hope to contribute to decisions on the viability of different solutions for supporting ML reproducibility.Comment: This research is supported by the Horizon Europe project TIER2 under grant agreement No 10109481

arXiv.org e-Print Archive

GuruFinder

Author: Cardús García Marc
Publication venue
Publication date: 26/09/2018
Field of study

Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona, Any: 2018, Tutor: José Mena i Jordi Vitrià i Marca[en] Imagine you are reading a newspaper, a blog, a scientific publication or a forum and you have become interested in a certain topic. After reading that site you may want to know more about it, so you click in a button and GuruFinder propose you a list of Twitter users to follow experts on that topic. The concept may be easy but the technology required underneath implies the combination of different disciplines including natural language processing, recommender systems, text mining, knowledge management systems and big data processing. GuruFinders pretends to explore state of the art techniques to build a prototype able to handle and process big volumes of tweets and offer real-time responses to the users

Diposit Digital de la Universitat de Barcelona

ir_metadata: An Extensible Metadata Schema for IR Experiments

Author: Breuer Timo
Keller Jüri
Schaer Philipp
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/07/2022
Field of study

The information retrieval (IR) community has a strong tradition of making the computational artifacts and resources available for future reuse, allowing the validation of experimental results. Besides the actual test collections, the underlying run files are often hosted in data archives as part of conferences like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide much information about the underlying experiment. For instance, the single run file is not of much use without the context of the shared task's website or the run data archive. In other domains, like the social sciences, it is good practice to annotate research data with metadata. In this work, we introduce ir_metadata - an extensible metadata schema for TREC run files based on the PRIMAD model. We propose to align the metadata annotations to PRIMAD, which considers components of computational experiments that can affect reproducibility. Furthermore, we outline important components and information that should be reported in the metadata and give evidence from the literature. To demonstrate the usefulness of these metadata annotations, we implement new features in repro_eval that support the outlined metadata schema for the use case of reproducibility studies. Additionally, we curate a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotate these with the corresponding metadata. In the experiments, we cover reproducibility experiments that are identified by the metadata and classified by PRIMAD. With this work, we enable IR researchers to annotate TREC run files and improve the reuse value of experimental artifacts even further.Comment: Resource pape

arXiv.org e-Print Archive

Cross-Domain Sentence Modeling for Relevance Transfer with BERT

Author: Akkalyoncu Yilmaz Zeynep
Publication venue: 'University of Waterloo'
Publication date: 06/12/2019
Field of study

Standard bag-of-words term-matching techniques in document retrieval fail to exploit rich semantic information embedded in the document texts. One promising recent trend in facilitating context-aware semantic matching has been the development of massively pretrained deep transformer models, culminating in BERT as their most popular example today. In this work, we propose adapting BERT as a neural re-ranker for document retrieval to achieve large improvements on news articles. Two fundamental issues arise in applying BERT to ``ad hoc'' document retrieval on newswire collections: relevance judgments in existing test collections are provided only at the document level, and documents often exceed the length that BERT was designed to handle. To overcome these challenges, we compute and aggregate sentence-level evidence to rank documents. The lack of appropriate relevance judgments in test collections is addressed by leveraging sentence-level and passage-level relevance judgments fortuitously available in collections from other domains to capture cross-domain notions of relevance. Our experiments demonstrate that models of relevance can be transferred across domains. By leveraging semantic cues learned across various domains, we propose a model that achieves state-of-the-art results on three standard TREC newswire collections. We explore the effects of cross-domain relevance transfer, and trade-offs between using document and sentence scores for document ranking. We also present an end-to-end document retrieval system that integrates the open-source Anserini information retrieval toolkit, discussing the related technical challenges and design decisions

University of Waterloo's Institutional Repository

PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development

Author: Bhat Riyaz
Bornea Mihaela
Fadnis Kshitij
Florian Radu
Franz Martin
Iyer Bhavani
Kumar Vishwajeet
Li Yulong
McCarley Scott
Rosenthal Sara
Roukos Salim
Sen Jaydeep
Sil Avirup
Sultan Md Arafat
Zhang Rong
Publication venue
Publication date: 25/01/2023
Field of study

The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa

arXiv.org e-Print Archive

Benchmarking Crisis in Social Media Analytics: A Solution for the Data Sharing Problem

Author: Assenmacher Dennis
Bradshaw Alison
Calero Valdez André
Cresci Stefano
Grimme Christian
Neumann Frank
Preuss Mike
Ross Björn
Trautmann Heike
Weber Derek
Publication venue: 'SAGE Publications'
Publication date: 01/01/2021
Field of study

Crossref

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

Improving OCR Post Processing with Machine Learning Tools

Author: Fonseca Cacho Jorge Ramon
Publication venue: Digital Scholarship@UNLV
Publication date: 01/08/2019
Field of study

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

University of Nevada, Las Vegas Repository