9 research outputs found

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    Data Science as a New Frontier for Design

    Full text link
    The purpose of this paper is to contribute to the challenge of transferring know-how, theories and methods from design research to the design processes in information science and technologies. More specifically, we shall consider a domain, namely data-science, that is becoming rapidly a globally invested research and development axis with strong imperatives for innovation given the data deluge we are currently facing. We argue that, in order to rise to the data-related challenges that the society is facing, data-science initiatives should ensure a renewal of traditional research methodologies that are still largely based on trial-error processes depending on the talent and insights of a single (or a restricted group of) researchers. It is our claim that design theories and methods can provide, at least to some extent, the much-needed framework. We will use a worldwide data-science challenge organized to study a technical problem in physics, namely the detection of Higgs boson, as a use case to demonstrate some of the ways in which design theory and methods can help in analyzing and shaping the innovation dynamics in such projects.Comment: International Conference on Engineering Design, Jul 2015, Milan, Ital

    Improved adaptive semi-unsupervised weighted oversampling (IA-SUWO) using sparsity factor for imbalanced datasets

    Get PDF
    The imbalanced data problem is common in data mining nowadays due to the skewed nature of data, which impact the classification process negatively in machine learning. For preprocessing, oversampling techniques significantly benefitted the imbalanced domain, in which artificial data is generated in minority class to enhance the number of samples and balance the distribution of samples in both classes. However, existing oversampling techniques encounter through overfitting and over-generalization problems which lessen the classifier performance. Although many clustering based oversampling techniques significantly overcome these problems but most of these techniques are not able to produce the appropriate number of synthetic samples in minority clusters. This study proposed an improved Adaptive Semi-unsupervised Weighted Oversampling (IA-SUWO) technique, using the sparsity factor which determine the sparse minority samples in each minority cluster. This technique consider the sparse minority samples which are far from the decision boundary. These samples also carry the important information for learning of minority class, if these samples are also considered for oversampling, imbalance ratio will be more reduce also it could enhance the learnability of the classifiers. The outcomes of the proposed approach have been compared with existing oversampling techniques such as SMOTE, Borderline-SMOTE, Safe-level SMOTE, and standard A-SUWO technique in terms of accuracy. As aforementioned, the comparative analysis revealed that the proposed oversampling approach performance increased in average by 5% from 85% to 90% than the existing comparative techniques

    Self-paced balance learning for clinical skin disease recognition

    Get PDF
    Class imbalance is a challenging problem in many classification tasks. It induces biased classification results for minority classes that contain less training samples than others. Most existing approaches aim to remedy the imbalanced number of instances among categories by resampling the majority and minority classes accordingly. However, the imbalanced level of difficulty of recognizing different categories is also crucial, especially for distinguishing samples with many classes. For example, in the task of clinical skin disease recognition, several rare diseases have a small number of training samples, but they are easy to diagnose because of their distinct visual properties. On the other hand, some common skin diseases, e.g., eczema, are hard to recognize due to the lack of special symptoms. To address this problem, we propose a self-paced balance learning (SPBL) algorithm in this paper. Specifically, we introduce a comprehensive metric termed the complexity of image category that is a combination of both sample number and recognition difficulty. First, the complexity is initialized using the model of the first pace, where the pace indicates one iteration in the self-paced learning paradigm. We then assign each class a penalty weight that is larger for more complex categories and smaller for easier ones, after which the curriculum is reconstructed by rearranging the training samples. Consequently, the model can iteratively learn discriminative representations via balancing the complexity in each pace. Experimental results on the SD-198 and SD-260 benchmark data sets demonstrate that the proposed SPBL algorithm performs favorably against the state-of-the-art methods. We also demonstrate the effectiveness of the SPBL algorithm's generalization capacity on various tasks, such as indoor scene image recognition and object classification

    Makine öğrenme algoritmaları kullanılarak yazılım hata kestiriminin iyileştirilmesi

    Get PDF
    06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.YÖK tez kataloğunda erişimi mevcut değildir

    A machine learning-based investigation of cloud service attacks

    Get PDF
    In this thesis, the security challenges of cloud computing are investigated in the Infrastructure as a Service (IaaS) layer, as security is one of the major concerns related to Cloud services. As IaaS consists of different security terms, the research has been further narrowed down to focus on Network Layer Security. Review of existing research revealed that several types of attacks and threats can affect cloud security. Therefore, there is a need for intrusion defence implementations to protect cloud services. Intrusion Detection (ID) is one of the most effective solutions for reacting to cloud network attacks. [Continues.

    Essays in financial technology: banking efficiency and application of machine learning models in Supply Chain Finance and credit risk assessment

    Get PDF
    The financial landscape is undergoing a significant transformation, driven by technological innovations that are reshaping traditional banking practices. This thesis examines the evolving relationship between financial technology (FinTech) and banking, specifically addressing the credit risk aspects within the domains of Supply Chain Finance (SCF) and peer-to-peer (P2P) lending. FinTech has experienced rapid growth and innovation over the past decade. It encompasses a wide range of technologies and services that aim to enhance and streamline financial processes, disrupt traditional banking models, and offer new solutions to consumers and businesses. The status of FinTech and banking is assessed through an extensive review of the current literature and empirical data. Accordingly, FinTech development has significantly impacted the financial landscape, driving innovation, competition, and customer expectations while it has exposed inefficiencies within traditional banking, it has also compelled banks to evolve and embrace technological advancements. The impact of FinTech on traditional banking models, customer behaviours, and market competition is aimed to be explored. This investigation highlights the challenges and opportunities that arise as FinTech disrupts and reshapes the banking sector, emphasizing its potential to enhance efficiency, accessibility, and customer experiences. As Chapter 3 focuses on an empirical analysis of the impact of FinTech on the operating efficiency of commercial banks in China. Further, in the context of credit risk, the thesis focuses on SCF and P2P lending, two prominent areas influenced by FinTech innovation. SCF has witnessed substantial transformation with the infusion of FinTech solutions. Digital platforms have streamlined the flow of funds within complex supply networks, enhancing the liquidity of suppliers and optimizing working capital for buyers. However, this transformation introduces new credit risk challenges. As suppliers' financial data becomes more accessible, the need for accurate risk assessment and predictive modelling becomes paramount. The integration of big data analytics, machine learning, and artificial intelligence (AI) holds the promise of refining credit risk evaluation by offering real-time insights into supplier financial health, thereby improving lending decisions and reducing defaults. Similarly, P2P lending has redefined the borrowing and lending landscape, enabling direct connections between individual borrowers and lenders. While P2P lending platforms offer speed, convenience, and access to credit for previously underserved segments, they also grapple with credit risk concerns. Evaluating the creditworthiness of individual borrowers without sufficient credit history demands innovative risk assessment methodologies. The emergence of data issues, such as imbalanced data issues, feature selection, and data processing, presents challenges in building accurate credit risk profiles for P2P lending participants. FinTech solutions play a pivotal role in creating and implementing these alternative risk assessment models. Note that, few studies in the literature investigate the benchmark of the advanced method of solving the credit risk assessment in emerging financial services. This thesis aims to address this research gap by evaluating the effectiveness of credit risk assessment models in these FinTech-driven contexts, considering both traditional methodologies and novel data-driven approaches. Chapter 4 investigates the credit risk assessment issue in Digital Supply Chain Finance (DSCF) with the Machine Learning approach and Chapter 5 emphasises the issue of data imbalance of credit risk assessment in P2P Lending. By addressing these gaps and issues, this thesis aims to contribute to the broader discourse on FinTech's role in shaping the future of banking. The findings have implications for financial institutions, policymakers, and regulators seeking to harness the benefits of FinTech while mitigating associated risks. Ultimately, this study offers insights into navigating the evolving landscape of credit risk in SCF and P2P lending within the context of an increasingly technology-driven financial ecosystem
    corecore