11 research outputs found

    Improving OCR Post Processing with Machine Learning Tools

    Full text link
    Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

    Designing an open-source cloud-native MLOps pipeline

    Get PDF
    Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead

    Consensus-based guidance for conducting and reporting multi-analyst studies

    Get PDF
    International audienceAny large dataset can be analyzed in a number of ways, and it is possible that the use of different analysis strategies will lead to different results and conclusions. One way to assess whether the results obtained depend on the analysis strategy chosen is to employ multiple analysts and leave each of them free to follow their own approach. Here, we present consensus-based guidance for conducting and reporting such multi-analyst studies, and we discuss how broader adoption of the multi-analyst approach has the potential to strengthen the robustness of results and conclusions obtained from analyses of datasets in basic and applied research

    Towards using intelligent techniques to assist software specialists in their tasks

    Full text link
    L’automatisation et l’intelligence constituent des préoccupations majeures dans le domaine de l’Informatique. Avec l’évolution accrue de l’Intelligence Artificielle, les chercheurs et l’industrie se sont orientés vers l’utilisation des modèles d’apprentissage automatique et d’apprentissage profond pour optimiser les tâches, automatiser les pipelines et construire des systèmes intelligents. Les grandes capacités de l’Intelligence Artificielle ont rendu possible d’imiter et même surpasser l’intelligence humaine dans certains cas aussi bien que d’automatiser les tâches manuelles tout en augmentant la précision, la qualité et l’efficacité. En fait, l’accomplissement de tâches informatiques nécessite des connaissances, une expertise et des compétences bien spécifiques au domaine. Grâce aux puissantes capacités de l’intelligence artificielle, nous pouvons déduire ces connaissances en utilisant des techniques d’apprentissage automatique et profond appliquées à des données historiques représentant des expériences antérieures. Ceci permettra, éventuellement, d’alléger le fardeau des spécialistes logiciel et de débrider toute la puissance de l’intelligence humaine. Par conséquent, libérer les spécialistes de la corvée et des tâches ordinaires leurs permettra, certainement, de consacrer plus du temps à des activités plus précieuses. En particulier, l’Ingénierie dirigée par les modèles est un sous-domaine de l’informatique qui vise à élever le niveau d’abstraction des langages, d’automatiser la production des applications et de se concentrer davantage sur les spécificités du domaine. Ceci permet de déplacer l’effort mis sur l’implémentation vers un niveau plus élevé axé sur la conception, la prise de décision. Ainsi, ceci permet d’augmenter la qualité, l’efficacité et productivité de la création des applications. La conception des métamodèles est une tâche primordiale dans l’ingénierie dirigée par les modèles. Par conséquent, il est important de maintenir une bonne qualité des métamodèles étant donné qu’ils constituent un artéfact primaire et fondamental. Les mauvais choix de conception, ainsi que les changements conceptuels répétitifs dus à l’évolution permanente des exigences, pourraient dégrader la qualité du métamodèle. En effet, l’accumulation de mauvais choix de conception et la dégradation de la qualité pourraient entraîner des résultats négatifs sur le long terme. Ainsi, la restructuration des métamodèles est une tâche importante qui vise à améliorer et à maintenir une bonne qualité des métamodèles en termes de maintenabilité, réutilisabilité et extensibilité, etc. De plus, la tâche de restructuration des métamodèles est délicate et compliquée, notamment, lorsqu’il s’agit de grands modèles. De là, automatiser ou encore assister les architectes dans cette tâche est très bénéfique et avantageux. Par conséquent, les architectes de métamodèles pourraient se concentrer sur des tâches plus précieuses qui nécessitent de la créativité, de l’intuition et de l’intelligence humaine. Dans ce mémoire, nous proposons une cartographie des tâches qui pourraient être automatisées ou bien améliorées moyennant des techniques d’intelligence artificielle. Ensuite, nous sélectionnons la tâche de métamodélisation et nous essayons d’automatiser le processus de refactoring des métamodèles. A cet égard, nous proposons deux approches différentes: une première approche qui consiste à utiliser un algorithme génétique pour optimiser des critères de qualité et recommander des solutions de refactoring, et une seconde approche qui consiste à définir une spécification d’un métamodèle en entrée, encoder les attributs de qualité et l’absence des design smells comme un ensemble de contraintes et les satisfaire en utilisant Alloy.Automation and intelligence constitute a major preoccupation in the field of software engineering. With the great evolution of Artificial Intelligence, researchers and industry were steered to the use of Machine Learning and Deep Learning models to optimize tasks, automate pipelines, and build intelligent systems. The big capabilities of Artificial Intelligence make it possible to imitate and even outperform human intelligence in some cases as well as to automate manual tasks while rising accuracy, quality, and efficiency. In fact, accomplishing software-related tasks requires specific knowledge and skills. Thanks to the powerful capabilities of Artificial Intelligence, we could infer that expertise from historical experience using machine learning techniques. This would alleviate the burden on software specialists and allow them to focus on valuable tasks. In particular, Model-Driven Engineering is an evolving field that aims to raise the abstraction level of languages and to focus more on domain specificities. This allows shifting the effort put on the implementation and low-level programming to a higher point of view focused on design, architecture, and decision making. Thereby, this will increase the efficiency and productivity of creating applications. For its part, the design of metamodels is a substantial task in Model-Driven Engineering. Accordingly, it is important to maintain a high-level quality of metamodels because they constitute a primary and fundamental artifact. However, the bad design choices as well as the repetitive design modifications, due to the evolution of requirements, could deteriorate the quality of the metamodel. The accumulation of bad design choices and quality degradation could imply negative outcomes in the long term. Thus, refactoring metamodels is a very important task. It aims to improve and maintain good quality characteristics of metamodels such as maintainability, reusability, extendibility, etc. Moreover, the refactoring task of metamodels is complex, especially, when dealing with large designs. Therefore, automating and assisting architects in this task is advantageous since they could focus on more valuable tasks that require human intuition. In this thesis, we propose a cartography of the potential tasks that we could either automate or improve using Artificial Intelligence techniques. Then, we select the metamodeling task and we tackle the problem of metamodel refactoring. We suggest two different approaches: A first approach that consists of using a genetic algorithm to optimize set quality attributes and recommend candidate metamodel refactoring solutions. A second approach based on mathematical logic that consists of defining the specification of an input metamodel, encoding the quality attributes and the absence of smells as a set of constraints and finally satisfying these constraints using Alloy

    A microservice architecture for the processing of large geospatial data in the Cloud

    Get PDF
    With the growing number of devices that can collect spatiotemporal information, as well as the improving quality of sensors, the geospatial data volume increases constantly. Before the raw collected data can be used, it has to be processed. Currently, expert users are still relying on desktop-based Geographic Information Systems to perform processing workflows. However, the volume of geospatial data and the complexity of processing algorithms exceeds the capacities of their workstations. There is a paradigm shift from desktop solutions towards the Cloud, which offers virtually unlimited storage space and computational power, but developers of processing algorithms often have no background in computer science and hence no expertise in Cloud Computing. Our research hypothesis is that a microservice architecture and Domain-Specific Languages can be used to orchestrate existing geospatial processing algorithms, and to compose and execute geospatial workflows in a Cloud environment for efficient application development and enhanced stakeholder experience. We present a software architecture that contains extension points for processing algorithms (or microservices), a workflow management component for distributed service orchestration, and a workflow editor based on a Domain-Specific Language. The main aim is to provide both users and developers with the means to leverage the possibilities of the Cloud, without requiring them to have a deep knowledge of distributed computing. In order to conduct our research, we follow the Design Science Research Methodology. We perform an analysis of the problem domain and collect requirements as well as quality attributes for our architecture. To meet our research objectives, we design the architecture and develop approaches to workflow management and workflow modelling. We demonstrate the utility of our solution by applying it to two real-world use cases and evaluate the quality of our architecture based on defined scenarios. Finally, we critically discuss our results. Our contributions to the scientific community can be classified into three pillars. We present a scalable and modifiable microservice architecture for geospatial processing that supports distributed development and has a high availability. Further, we present novel approaches to service integration and orchestration in the Cloud as well as rule-based and dynamic workflow management without a priori design-time knowledge. For the workflow modelling we create a Domain-Specific Language that is based on a novel language design method

    Gathering solutions and providing APIs for their orchestration to implement continuous software delivery

    Get PDF
    In traditional IT environments, it is common for software updates and new releases to take up to several weeks or even months to be eventually available to end users. Therefore, many IT vendors and providers of software products and services face the challenge of delivering updates considerably more frequently. This is because users, customers, and other stakeholders expect accelerated feedback loops and significantly faster responses to changing demands and issues that arise. Thus, taking this challenge seriously is of utmost economic importance for IT organizations if they wish to remain competitive. Continuous software delivery is an emerging paradigm adopted by an increasing number of organizations in order to address this challenge. It aims to drastically shorten release cycles while ensuring the delivery of high-quality software. Adopting continuous delivery essentially means to make it economical to constantly deliver changes in small batches. Infrequent high-risk releases with lots of accumulated changes are thereby replaced by a continuous stream of small and low-risk updates. To gain from the benefits of continuous delivery, a high degree of automation is required. This is technically achieved by implementing continuous delivery pipelines consisting of different application-specific stages (build, test, production, etc.) to automate most parts of the application delivery process. Each stage relies on a corresponding application environment such as a build environment or production environment. This work presents concepts and approaches to implement continuous delivery pipelines based on systematically gathered solutions to be used and orchestrated as building blocks of application environments. Initially, the presented Gather'n'Deliver method is centered around a shared knowledge base to provide the foundation for gathering, utilizing, and orchestrating diverse solutions such as deployment scripts, configuration definitions, and Cloud services. Several classification dimensions and taxonomies are discussed in order to facilitate a systematic categorization of solutions, in addition to expressing application environment requirements that are satisfied by those solutions. The presented GatherBase framework enables the collaborative and automated gathering of solutions through solution repositories. These repositories are the foundation for building diverse knowledge base variants that provide fine-grained query mechanisms to find and retrieve solutions, for example, to be used as building blocks of specific application environments. Combining and integrating diverse solutions at runtime is achieved by orchestrating their APIs. Since some solutions such as lower-level executable artifacts (deployment scripts, configuration definitions, etc.) do not immediately provide their functionality through APIs, additional APIs need to be supplied. This issue is addressed by different approaches, such as the presented Any2API framework that is intended to generate individual APIs for such artifacts. An integrated architecture in conjunction with corresponding prototype implementations aims to demonstrate the technical feasibility of the presented approaches. Finally, various validation scenarios evaluate the approaches within the scope of continuous delivery and application environments and even beyond

    Software evolution: hypergraph based model of solution space andhmeta-search

    Get PDF
    A hypergraph based model of software evolution is proposed. The model uses software assets, and any other higher order patterns, as reusable components. We will use software product lines and software factories concepts as the engineering state-of-the-art framework to model evolution. Using those concepts, the solution space is sliced into sub-spaces using equivalence classes and their corresponding isomorphism. Any valid graph expansions will be required to retain information by being sub-graph isomorphic, forming a chain to a solution. We are also able to traverse the resulting modelled space. A characteristic set of operators and operands is used to find solutions that would be compatible. The result is in a structured manner to explore the combinatorial solution space, classifying solutions as part of families hierarchies. Using a software engineering interpretation a viable prototype implementation of the model has been created. It uses configuration files that are used as design-time instruments analogous to software factory schemas. These form configuration layers we call fragments. These fragments convert to graph node metadata to later allow complex graph queries. A profusion of examples of the modelling and its visualisation options are provided for better understanding. An example of automated generation of a configuration, using current Google Cloud assets, has been generated and added to the prototype. It illustrates automation possibilities by using harvested web data, and later, creating a custom isomorphic relation as a configuration. The feasibility of the model is thus demonstrated. The formalisation adds the rigour needed to further facilitate automation of software craftsmanship. Based on the model operation, we propose a concept of organic growth based on evolution. Evolution events are modelled after incremental change messages. This is communication efficient and it is shown to adhere to the Representational State Transfer architectural style. Finally, The Cloud is presented as an evolved solution part of a family, from the original concept of The Web
    corecore