102 research outputs found
Reproducibility in Machine Learning-Driven Research
Research is facing a reproducibility crisis, in which the results and
findings of many studies are difficult or even impossible to reproduce. This is
also the case in machine learning (ML) and artificial intelligence (AI)
research. Often, this is the case due to unpublished data and/or source-code,
and due to sensitivity to ML training conditions. Although different solutions
to address this issue are discussed in the research community such as using ML
platforms, the level of reproducibility in ML-driven research is not increasing
substantially. Therefore, in this mini survey, we review the literature on
reproducibility in ML-driven research with three main aims: (i) reflect on the
current situation of ML reproducibility in various research fields, (ii)
identify reproducibility issues and barriers that exist in these research
fields applying ML, and (iii) identify potential drivers such as tools,
practices, and interventions that support ML reproducibility. With this, we
hope to contribute to decisions on the viability of different solutions for
supporting ML reproducibility.Comment: This research is supported by the Horizon Europe project TIER2 under
grant agreement No 10109481
GuruFinder
Treballs finals del Mà ster de Fonaments de Ciència de Dades, Facultat de matemà tiques, Universitat de Barcelona, Any: 2018, Tutor: José Mena i Jordi Vitrià i Marca[en] Imagine you are reading a newspaper, a blog, a scientific publication or a forum and you have become interested in a certain topic. After reading that site you may want to know more about it, so you click in a button and GuruFinder propose you a list of Twitter users to follow experts on that topic. The concept may be easy but the technology required underneath implies the combination of different disciplines including natural language processing, recommender systems, text mining, knowledge management systems and big data processing. GuruFinders pretends to explore state of the art techniques to build a prototype able to handle and process big volumes of tweets and offer real-time responses to the users
ir_metadata: An Extensible Metadata Schema for IR Experiments
The information retrieval (IR) community has a strong tradition of making the
computational artifacts and resources available for future reuse, allowing the
validation of experimental results. Besides the actual test collections, the
underlying run files are often hosted in data archives as part of conferences
like TREC, CLEF, or NTCIR. Unfortunately, the run data itself does not provide
much information about the underlying experiment. For instance, the single run
file is not of much use without the context of the shared task's website or the
run data archive. In other domains, like the social sciences, it is good
practice to annotate research data with metadata. In this work, we introduce
ir_metadata - an extensible metadata schema for TREC run files based on the
PRIMAD model. We propose to align the metadata annotations to PRIMAD, which
considers components of computational experiments that can affect
reproducibility. Furthermore, we outline important components and information
that should be reported in the metadata and give evidence from the literature.
To demonstrate the usefulness of these metadata annotations, we implement new
features in repro_eval that support the outlined metadata schema for the use
case of reproducibility studies. Additionally, we curate a dataset with run
files derived from experiments with different instantiations of PRIMAD
components and annotate these with the corresponding metadata. In the
experiments, we cover reproducibility experiments that are identified by the
metadata and classified by PRIMAD. With this work, we enable IR researchers to
annotate TREC run files and improve the reuse value of experimental artifacts
even further.Comment: Resource pape
Cross-Domain Sentence Modeling for Relevance Transfer with BERT
Standard bag-of-words term-matching techniques in document retrieval fail to exploit rich semantic information embedded in the document texts. One promising recent trend in facilitating context-aware semantic matching has been the development of massively pretrained deep transformer models, culminating in BERT as their most popular example today. In this work, we propose adapting BERT as a neural re-ranker for document retrieval to achieve large improvements on news articles. Two fundamental issues arise in applying BERT to ``ad hoc'' document retrieval on newswire collections: relevance judgments in existing test collections are provided only at the document level, and documents often exceed the length that BERT was designed to handle. To overcome these challenges, we compute and aggregate sentence-level evidence to rank documents. The lack of appropriate relevance judgments in test collections is addressed by leveraging sentence-level and passage-level relevance judgments fortuitously available in collections from other domains to capture cross-domain notions of relevance. Our experiments demonstrate that models of relevance can be transferred across domains. By leveraging semantic cues learned across various domains, we propose a model that achieves state-of-the-art results on three standard TREC newswire collections. We explore the effects of cross-domain relevance transfer, and trade-offs between using document and sentence scores for document ranking. We also present an end-to-end document retrieval system that integrates the open-source Anserini information retrieval toolkit, discussing the related technical challenges and design decisions
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development
The field of Question Answering (QA) has made remarkable progress in recent
years, thanks to the advent of large pre-trained language models, newer
realistic benchmark datasets with leaderboards, and novel algorithms for key
components such as retrievers and readers. In this paper, we introduce PRIMEQA:
a one-stop and open-source QA repository with an aim to democratize QA
re-search and facilitate easy replication of state-of-the-art (SOTA) QA
methods. PRIMEQA supports core QA functionalities like retrieval and reading
comprehension as well as auxiliary capabilities such as question generation.It
has been designed as an end-to-end toolkit for various use cases: building
front-end applications, replicating SOTA methods on pub-lic benchmarks, and
expanding pre-existing methods. PRIMEQA is available at :
https://github.com/primeqa
Improving OCR Post Processing with Machine Learning Tools
Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents.
The main contributions of this work are:
• Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate.
• Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected.
• Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text.
• Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed
- …