33 research outputs found
A Quantitative Study of Java Software Buildability
Researchers, students and practitioners often encounter a situation when the
build process of a third-party software system fails. In this paper, we aim to
confirm this observation present mainly as anecdotal evidence so far. Using a
virtual environment simulating a programmer's one, we try to fully
automatically build target archives from the source code of over 7,200 open
source Java projects. We found that more than 38% of builds ended in failure.
Build log analysis reveals the largest portion of errors are
dependency-related. We also conduct an association study of factors affecting
build success
Improving OCR Post Processing with Machine Learning Tools
Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents.
The main contributions of this work are:
• Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate.
• Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected.
• Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text.
• Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed
Designing an open-source cloud-native MLOps pipeline
Deploying machine learning models is found to be a massive issue in the field. DevOps and
Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software
that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in
the field solve them with the use of proprietary tooling, often offered by cloud providers.
In this thesis, we describe the elements of MLOps. We study what the requirements to automate
the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible
to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling
in a cloud provider agnostic way.
We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of
Machine Learning system. We motivated why Machine Learning systems should be included
in the DevOps methodology. We studied what unique challenges machine learning brings to
CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design,
architecture, and implementation details and its applicability and value to Machine Learning
projects.
We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues
of automating a reproducible Machine Learning project and its delivery to production. We
designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring
the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML,
JSON) that require minimal learning overhead
Automated Driver Management for Selenium WebDriver
Selenium WebDriver is a framework used to control web browsers automatically. It provides a cross-browser Application Programming Interface (API) for different languages (e.g., Java, Python, or JavaScript) that allows automatic navigation, user impersonation, and verification of web applications. Internally, Selenium WebDriver makes use of the native automation support of each browser. Hence, a platform-dependent binary file (the so-called driver) must be placed between the Selenium WebDriver script and the browser to support this native communication. The management (i.e., download, setup, and maintenance) of these drivers is cumbersome for practitioners. This paper provides a complete methodology to automate this management process. Particularly, we present WebDriverManager, the reference tool implementing this methodology. WebDriverManager provides different execution methods: as a Java dependency, as a Command-Line Interface (CLI) tool, as a server, as a Docker container, and as a Java agent. To provide empirical validation of the proposed approach, we surveyed the WebDriverManager users. The aim of this study is twofold. First, we assessed the extent to which WebDriverManager is adopted and used. Second, we evaluated the WebDriverManager API following Clarke’s usability dimensions. A total of 148 participants worldwide completed this survey in 2020. The results show a remarkable assessment of the automation capabilities and API usability of WebDriverManager by Java users, but a scarce adoption for other languages.This work has been been supported in part by the "Análisis en tiempo Real de sensores sociALes y EStimación de recursos para transporte multimodal basada en aprendizaje profundo" project (MaGIST-RALES), funded by the Spanish Agencia Estatal de Investigación (AEI, doi 10.13039/501100011033) under grant PID2019-105221RB-C44. This work also received partial support from FEDER/Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación through project Smartlet (TIN2017-85179-C3-1-R), and from the eMadrid Network, which is funded by the Madrid Regional Government (Comunidad de Madrid) with grant No. S2018/TCS-4307
Science Forum: Consensus-based guidance for conducting and reporting multi-analyst studies
Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research
Consensus-based guidance for conducting and reporting multi-analyst studies
International audienceAny large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research
Improving pipelining tools for pre-processing data
The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of
pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming
languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR
Improving Pipelining Tools for Pre-processing Data
The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features
Recommended from our members
Consensus-based guidance for conducting and reporting multi-analyst studies
Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research