Search CORE

33 research outputs found

A Quantitative Study of Java Software Buildability

Author: Beller M.
Informatik Schloss
Smith P.
Spolsky J.
Sulír M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/12/2017
Field of study

Researchers, students and practitioners often encounter a situation when the build process of a third-party software system fails. In this paper, we aim to confirm this observation present mainly as anecdotal evidence so far. Using a virtual environment simulating a programmer's one, we try to fully automatically build target archives from the source code of over 7,200 open source Java projects. We found that more than 38% of builds ended in failure. Build log analysis reveals the largest portion of errors are dependency-related. We also conduct an association study of factors affecting build success

arXiv.org e-Print Archive

Crossref

Improving OCR Post Processing with Machine Learning Tools

Author: Fonseca Cacho Jorge Ramon
Publication venue: Digital Scholarship@UNLV
Publication date: 01/08/2019
Field of study

Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

University of Nevada, Las Vegas Repository

Designing an open-source cloud-native MLOps pipeline

Author: Mäkinen Sasu
Publication venue: Helsingfors universitet
Publication date: 01/01/2021
Field of study

Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead

Helsingin yliopiston digitaalinen arkisto

Automated Driver Management for Selenium WebDriver

Author: Alario-Hoyos Carlos
Delgado Kloos Carlos
García Gutiérrez Boni
Muñoz Organero Mario
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2021
Field of study

Selenium WebDriver is a framework used to control web browsers automatically. It provides a cross-browser Application Programming Interface (API) for different languages (e.g., Java, Python, or JavaScript) that allows automatic navigation, user impersonation, and verification of web applications. Internally, Selenium WebDriver makes use of the native automation support of each browser. Hence, a platform-dependent binary file (the so-called driver) must be placed between the Selenium WebDriver script and the browser to support this native communication. The management (i.e., download, setup, and maintenance) of these drivers is cumbersome for practitioners. This paper provides a complete methodology to automate this management process. Particularly, we present WebDriverManager, the reference tool implementing this methodology. WebDriverManager provides different execution methods: as a Java dependency, as a Command-Line Interface (CLI) tool, as a server, as a Docker container, and as a Java agent. To provide empirical validation of the proposed approach, we surveyed the WebDriverManager users. The aim of this study is twofold. First, we assessed the extent to which WebDriverManager is adopted and used. Second, we evaluated the WebDriverManager API following Clarke’s usability dimensions. A total of 148 participants worldwide completed this survey in 2020. The results show a remarkable assessment of the automation capabilities and API usability of WebDriverManager by Java users, but a scarce adoption for other languages.This work has been been supported in part by the "Análisis en tiempo Real de sensores sociALes y EStimación de recursos para transporte multimodal basada en aprendizaje profundo" project (MaGIST-RALES), funded by the Spanish Agencia Estatal de Investigación (AEI, doi 10.13039/501100011033) under grant PID2019-105221RB-C44. This work also received partial support from FEDER/Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación through project Smartlet (TIN2017-85179-C3-1-R), and from the eMadrid Network, which is funded by the Madrid Regional Government (Comunidad de Madrid) with grant No. S2018/TCS-4307

Universidad Carlos III de Madrid e-Archivo

Science Forum: Consensus-based guidance for conducting and reporting multi-analyst studies

Author: Aczel Balazs
Akker Olmo R. van den
Albers Casper J.
Alm van Assen Marcel
Bastiaansen Jojanneke A.
Benjamin Daniel
Boehm Udo
Botvinik-Nezer Rotem
Bringmann Laura F.
Busch Niko A.
Caruyer Emmanuel
Cataldo Andrea M.
Cowan Nelson
Delios Andrew
Dongen Noah N. N. van
Donkin Chris
Doorn Johnny B. van
Dreber Anna
Dutilh Gilles
Egan Gary F.
Gernsbacher Morton Ann
Hoekstra Rink
Hoffmann Sabine
Holzmeister Felix
Huber Jürgen
Johannesson Magnus
Jonas Kai J.
Kindel Alexander T.
Kirchler Michael
Kunkels Yoram K.
Lindsay D. Stephen
Mangin Jean-Francois
Matzke Dora
Munafo Marcus R.
Newell Ben R.
Nilsonne Gustav
Nosek Brian A.
Poldrack Russell A.
Ravenzwaaij Don van
Rieskamp Jorg
Salganik Matthew J.
Sarafoglou Alexandra
Schonberg Tom
Schweinsberg Martin
Shanks David
Silberzahn Raphael
Simons Daniel J.
Spellman Barbara A.
St.-Jean Samuel
Starns Jeffrey J.
Szaszi Barnabas
Uhlmann Eric Luis
Wagenmakers Eric-Jan
Wicherts Jelte
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2021
Field of study

Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research

Open Access LMU

Consensus-based guidance for conducting and reporting multi-analyst studies

Author: Aczel Balazs
Albers Casper J
Bastiaansen Jojanneke A
Benjamin Daniel
Boehm Udo
Botvinik-Nezer Rotem
Bringmann Laura F
Busch Niko A
Caruyer Emmanuel
Cataldo Andrea M
Cowan Nelson
Delios Andrew
Donkin Chris
Dreber Ann
Dutilh Gilles
Egan Gary F
Gernsbacher Morton Ann
Hoekstra Rink
Hoffmann Sabine
Holzmeister Felix
Huber Juergen
Johannesson Magnus
Jonas Kai J
Kindel Alexander T
Kirchler Michael
Kunkels Yoram K
Lindsay D Stephen
Mangin Jean-Francois
Matzke Dora
Munafò Marcus R
Newell Ben R
Nilsonne Gustav
Nosek Brian A
Poldrack Russell A
Rieskamp Jörg
Salganik Matthew J
Sarafoglou Alexandra
Schonberg Tom
Schweinsberg Martin
Shanks David
Silberzahn Raphael
Simons Daniel J
Spellman Barbara A
St-Jean Samuel
Starns Jeffrey J
Szaszi Barnabas
Uhlmann Eric Luis
van Assen Marcel Alm
van den Akker Olmo R
van Dongen Noah N N
van Doorn Johnny B
van Ravenzwaaij Don
Wagenmakers Eric-Jan
Wicherts Jelte
Publication venue: 'eLife Sciences Publications, Ltd'
Publication date: 01/01/2021
Field of study

International audienceAny large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research

Improving pipelining tools for pre-processing data

Author: Lage Yeray
Laza Fidalgo Rosalía
Méndez Reboredo José Ramón
Novo Lourés María
Pavón Rial Maria Reyes
Ruano Ordás David Alfonso
Publication venue: Sistemas Informáticos de Nova Xeración
Publication date: 04/12/2023
Field of study

The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR

Investigo