33 research outputs found

    A Quantitative Study of Java Software Buildability

    Full text link
    Researchers, students and practitioners often encounter a situation when the build process of a third-party software system fails. In this paper, we aim to confirm this observation present mainly as anecdotal evidence so far. Using a virtual environment simulating a programmer's one, we try to fully automatically build target archives from the source code of over 7,200 open source Java projects. We found that more than 38% of builds ended in failure. Build log analysis reveals the largest portion of errors are dependency-related. We also conduct an association study of factors affecting build success

    Improving OCR Post Processing with Machine Learning Tools

    Full text link
    Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

    Designing an open-source cloud-native MLOps pipeline

    Get PDF
    Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead

    Automated Driver Management for Selenium WebDriver

    Get PDF
    Selenium WebDriver is a framework used to control web browsers automatically. It provides a cross-browser Application Programming Interface (API) for different languages (e.g., Java, Python, or JavaScript) that allows automatic navigation, user impersonation, and verification of web applications. Internally, Selenium WebDriver makes use of the native automation support of each browser. Hence, a platform-dependent binary file (the so-called driver) must be placed between the Selenium WebDriver script and the browser to support this native communication. The management (i.e., download, setup, and maintenance) of these drivers is cumbersome for practitioners. This paper provides a complete methodology to automate this management process. Particularly, we present WebDriverManager, the reference tool implementing this methodology. WebDriverManager provides different execution methods: as a Java dependency, as a Command-Line Interface (CLI) tool, as a server, as a Docker container, and as a Java agent. To provide empirical validation of the proposed approach, we surveyed the WebDriverManager users. The aim of this study is twofold. First, we assessed the extent to which WebDriverManager is adopted and used. Second, we evaluated the WebDriverManager API following Clarke’s usability dimensions. A total of 148 participants worldwide completed this survey in 2020. The results show a remarkable assessment of the automation capabilities and API usability of WebDriverManager by Java users, but a scarce adoption for other languages.This work has been been supported in part by the "Análisis en tiempo Real de sensores sociALes y EStimación de recursos para transporte multimodal basada en aprendizaje profundo" project (MaGIST-RALES), funded by the Spanish Agencia Estatal de Investigación (AEI, doi 10.13039/501100011033) under grant PID2019-105221RB-C44. This work also received partial support from FEDER/Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación through project Smartlet (TIN2017-85179-C3-1-R), and from the eMadrid Network, which is funded by the Madrid Regional Government (Comunidad de Madrid) with grant No. S2018/TCS-4307

    Science Forum: Consensus-based guidance for conducting and reporting multi-analyst studies

    Get PDF
    Any large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research

    Consensus-based guidance for conducting and reporting multi-analyst studies

    Get PDF
    International audienceAny large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research

    Improving pipelining tools for pre-processing data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features.Agencia Estatal de Investigación | Ref. TIN2017-84658-C2-1-RXunta de Galicia | Ref. ED481D-2021/024Xunta de Galicia | Ref. ED431C2018/55-GR

    Improving Pipelining Tools for Pre-processing Data

    Get PDF
    The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features
    corecore