5,086 research outputs found

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research

    The importance of Quality Assurance as a Data Scientist: Commom pitfalls, examples and solutions found while validationand developing supervised binary classification models

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn today’s information era, where Data galvanizes change, companies are aiming towards competitive advantage by mining this important resource to achieve actionable insights, knowledge, and wisdom. However, to minimize bias and obtain robust long-term solutions, the methodologies that are devised from Data Science and Machine Learning approaches benefit from being carefully validated by a Quality Assurance Data Scientist, who understands not only both business rules and analytics tasks, but also understands and recommends Quality Assurance guidelines and validations. Through my experience as a Data Scientist at EDP Distribuição, I identify and systematically report on seven key Quality Assurance guidelines that helped achieve more reliable products and provided three practical examples where validation was key in discerning improvements

    Towards Data-centric Graph Machine Learning: Review and Outlook

    Full text link
    Data-centric AI, with its primary focus on the collection, management, and utilization of data to drive AI models and applications, has attracted increasing attention in recent years. In this article, we conduct an in-depth and comprehensive review, offering a forward-looking outlook on the current efforts in data-centric AI pertaining to graph data-the fundamental data structure for representing and capturing intricate dependencies among massive and diverse real-life entities. We introduce a systematic framework, Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of the graph data lifecycle, including graph data collection, exploration, improvement, exploitation, and maintenance. A thorough taxonomy of each stage is presented to answer three critical graph-centric questions: (1) how to enhance graph data availability and quality; (2) how to learn from graph data with limited-availability and low-quality; (3) how to build graph MLOps systems from the graph data-centric view. Lastly, we pinpoint the future prospects of the DC-GML domain, providing insights to navigate its advancements and applications.Comment: 42 pages, 9 figure

    Automated anomaly recognition in real time data streams for oil and gas industry.

    Get PDF
    There is a growing demand for computer-assisted real-time anomaly detection - from the identification of suspicious activities in cyber security, to the monitoring of engineering data for various applications across the oil and gas, automotive and other engineering industries. To reduce the reliance on field experts' knowledge for identification of these anomalies, this thesis proposes a deep-learning anomaly-detection framework that can help to create an effective real-time condition-monitoring framework. The aim of this research is to develop a real-time and re-trainable generic anomaly-detection framework, which is capable of predicting and identifying anomalies with a high level of accuracy - even when a specific anomalous event has no precedent. Machine-based condition monitoring is preferable in many practical situations where fast data analysis is required, and where there are harsh climates or otherwise life-threatening environments. For example, automated conditional monitoring systems are ideal in deep sea exploration studies, offshore installations and space exploration. This thesis firstly reviews studies about anomaly detection using machine learning. It then adopts the best practices from those studies in order to propose a multi-tiered framework for anomaly detection with heterogeneous input sources, which can deal with unseen anomalies in a real-time dynamic problem environment. The thesis then applies the developed generic multi-tiered framework to two fields of engineering: data analysis and malicious cyber attack detection. Finally, the framework is further refined based on the outcomes of those case studies and is used to develop a secure cross-platform API, capable of re-training and data classification on a real-time data feed

    Machine Learning and Data Mining Applications in Power Systems

    Get PDF
    This Special Issue was intended as a forum to advance research and apply machine-learning and data-mining methods to facilitate the development of modern electric power systems, grids and devices, and smart grids and protection devices, as well as to develop tools for more accurate and efficient power system analysis. Conventional signal processing is no longer adequate to extract all the relevant information from distorted signals through filtering, estimation, and detection to facilitate decision-making and control actions. Machine learning algorithms, optimization techniques and efficient numerical algorithms, distributed signal processing, machine learning, data-mining statistical signal detection, and estimation may help to solve contemporary challenges in modern power systems. The increased use of digital information and control technology can improve the grid’s reliability, security, and efficiency; the dynamic optimization of grid operations; demand response; the incorporation of demand-side resources and integration of energy-efficient resources; distribution automation; and the integration of smart appliances and consumer devices. Signal processing offers the tools needed to convert measurement data to information, and to transform information into actionable intelligence. This Special Issue includes fifteen articles, authored by international research teams from several countries

    Automating Fault Detection and Quality Control in PCBs: A Machine Learning Approach to Handle Imbalanced Data

    Get PDF
    Printed Circuit Boards (PCBs) are fundamental to the operation of a wide array of electronic devices, from consumer electronics to sophisticated industrial machinery. Given this pivotal role, quality control and fault detection are especially significant, as they are essential for ensuring the devices' long-term reliability and efficiency. To address this, the thesis explores advancements in fault detection and quality control methods for PCBs, with a focus on Machine Learning (ML) and Deep Learning (DL) techniques. The study begins with an in-depth review of traditional approaches like visual and X-ray inspections, then delves into modern, data-driven methods, such as automated anomaly detection in PCB manufacturing using tabular datasets. The core of the thesis is divided into three specific tasks: firstly, applying ML and DL models for anomaly detection in PCBs, particularly focusing on solder-pasting issues and the challenges posed by imbalanced datasets; secondly, predicting human inspection labels through specially designed tabular models like TabNet; and thirdly, implementing multi-classification methods to automate repair labeling on PCBs. The study is structured to offer a comprehensive view, beginning with background information, followed by the methodology and results of each task, and concluding with a summary and directions for future research. Through this systematic approach, the research not only provides new insights into the capabilities and limitations of existing fault detection techniques but also sets the stage for more intelligent and efficient systems in PCB manufacturing and quality control

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    A Feasibility Study of Azure Machine Learning for Sheet Metal Fabrication

    Get PDF
    The research demonstrated that sheet metal fabrication machines can utilize machine learning to gain competitive advantage. With various possible applications of machine learning, it was decided to focus on the topic of predictive maintenance. Implementation of the predictive service is accomplished with Microsoft Azure Machine Learning. The aim was to demonstrate to the stakeholders at the case company potential laying in machine learning. It was found that besides machine learning technologies being founded on sophisticated algorithms and mathematics it can still be utilized and bring benefits with moderate effort required. Significance of this study is in it demonstrating potentials of the machine learning to be used in improving operations management and especially for sheet metal fabrication machines.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format
    corecore