630 research outputs found

    Dealing with imbalanced and weakly labelled data in machine learning using fuzzy and rough set methods

    Get PDF

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Fuzzy-rough-learn 0.1 : a Python library for machine learning with fuzzy rough sets

    Get PDF
    We present fuzzy-rough-learn, the first Python library of fuzzy rough set machine learning algorithms. It contains three algorithms previously implemented in R and Java, as well as two new algorithms from the recent literature. We briefly discuss the use cases of fuzzy-rough-learn and the design philosophy guiding its development, before providing an overview of the included algorithms and their parameters

    Three-way Imbalanced Learning based on Fuzzy Twin SVM

    Full text link
    Three-way decision (3WD) is a powerful tool for granular computing to deal with uncertain data, commonly used in information systems, decision-making, and medical care. Three-way decision gets much research in traditional rough set models. However, three-way decision is rarely combined with the currently popular field of machine learning to expand its research. In this paper, three-way decision is connected with SVM, a standard binary classification model in machine learning, for solving imbalanced classification problems that SVM needs to improve. A new three-way fuzzy membership function and a new fuzzy twin support vector machine with three-way membership (TWFTSVM) are proposed. The new three-way fuzzy membership function is defined to increase the certainty of uncertain data in both input space and feature space, which assigns higher fuzzy membership to minority samples compared with majority samples. To evaluate the effectiveness of the proposed model, comparative experiments are designed for forty-seven different datasets with varying imbalance ratios. In addition, datasets with different imbalance ratios are derived from the same dataset to further assess the proposed model's performance. The results show that the proposed model significantly outperforms other traditional SVM-based methods

    A Short Survey on Data Clustering Algorithms

    Full text link
    With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end

    Machine learning based data pre-processing for the purpose of medical data mining and decision support

    Get PDF
    Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. Sometimes, improved data quality is itself the goal of the analysis, usually to improve processes in a production database and the designing of decision support. As medicine moves forward there is a need for sophisticated decision support systems that make use of data mining to support more orthodox knowledge engineering and Health Informatics practice. However, the real-life medical data rarely complies with the requirements of various data mining tools. It is often inconsistent, noisy, containing redundant attributes, in an unsuitable format, containing missing values and imbalanced with regards to the outcome class label.Many real-life data sets are incomplete, with missing values. In medical data mining the problem with missing values has become a challenging issue. In many clinical trials, the medical report pro-forma allow some attributes to be left blank, because they are inappropriate for some class of illness or the person providing the information feels that it is not appropriate to record the values for some attributes. The research reported in this thesis has explored the use of machine learning techniques as missing value imputation methods. The thesis also proposed a new way of imputing missing value by supervised learning. A classifier was used to learn the data patterns from a complete data sub-set and the model was later used to predict the missing values for the full dataset. The proposed machine learning based missing value imputation was applied on the thesis data and the results are compared with traditional Mean/Mode imputation. Experimental results show that all the machine learning methods which we explored outperformed the statistical method (Mean/Mode).The class imbalance problem has been found to hinder the performance of learning systems. In fact, most of the medical datasets are found to be highly imbalance in their class label. The solution to this problem is to reduce the gap between the minority class samples and the majority class samples. Over-sampling can be applied to increase the number of minority class sample to balance the data. The alternative to over-sampling is under-sampling where the size of majority class sample is reduced. The thesis proposed one cluster based under-sampling technique to reduce the gap between the majority and minority samples. Different under-sampling and over-sampling techniques were explored as ways to balance the data. The experimental results show that for the thesis data the new proposed modified cluster based under-sampling technique performed better than other class balancing techniques.In further research it is found that the class imbalance problem not only affects the classification performance but also has an adverse effect on feature selection. The thesis proposed a new framework for feature selection for class imbalanced datasets. The research found that, using the proposed framework the classifier needs less attributes to show high accuracy, and more attributes are needed if the data is highly imbalanced.The research described in the thesis contains the flowing four novel main contributions.a) Improved data mining methodology for mining medical datab) Machine learning based missing value imputation methodc) Cluster Based semi-supervised class balancing methodd) Feature selection framework for class imbalance datasetsThe performance analysis and comparative study show that the use of proposed method of missing value imputation, class balancing and feature selection framework can provide an effective approach to data preparation for building medical decision support

    On the class overlap problem in imbalanced data classification.

    Get PDF
    Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance

    Uncertainty-wise software anti-patterns detection: A possibilistic evolutionary machine learning approach

    Get PDF
    Context: Code smells (a.k.a. anti-patterns) are manifestations of poor design solutions that can deteriorate software maintainability and evolution. Research gap: Existing works did not take into account the issue of uncertain class labels, which is an important inherent characteristic of the smells detection problem. More precisely, two human experts may have different degrees of uncertainty about the smelliness of a particular software class not only for the smell detection task but also for the smell type identification one. Unluckily, existing approaches usually reject and/or ignore uncertain data that correspond to software classes (i.e. dataset instances) with uncertain labels. Throwing away and/or disregarding the uncertainty factor could considerably degrade the detection/identification process effectiveness. From a solution approach viewpoint, there is no work in the literature that proposed a method that is able to detect and/or identify code smells while preserving the uncertainty aspect. Objective: The main goal of our research work is to handle the uncertainty factor, issued from human experts, in detecting and/or identifying code smells by proposing an evolutionary approach that is able to deal with anti-patterns classification with uncertain labels. Method: We suggest Bi-ADIPOK, as an effective search-based tool that is capable to tackle the previously mentioned challenge for both detection and identification cases. The proposed method corresponds to an EA (Evolutionary Algorithm) that optimizes a set of detectors encoded as PK-NNs (Possibilistic K-nearest neighbors) based on a bi-level hierarchy, in which the upper level role consists on finding the optimal PK-NNs parameters, while the lower level one is to generate the PK-NNs. A newly fitness function has been proposed fitness function PomAURPC-OVA_dist (Possibilistic modified Area Under Recall Precision Curve One-Versus-All_distance, abbreviated PAURPC_d in this paper). Bi-ADIPOK is able to deal with label uncertainty using some concepts stemming from the Possibility Theory. Furthermore, the PomAURPC-OVA_dist is capable to process the uncertainty issue even with imbalanced data. We notice that Bi-ADIPOK is first built and then validated using a possibilistic base of smell examples that simulates and mimics the subjectivity of software engineers opinions. Results: The statistical analysis of the obtained results on a set of comparative experiments with respect to four relevant state-of-the-art methods shows the merits of our proposal. The obtained detection results demonstrate that, for the uncertain environment, the PomAURPC-OVA_dist of Bi-ADIPOK ranges between 0.902 and 0.932 and its IAC lies between 0.9108 and 0.9407, while for the certain environment, the PomAURPC-OVA_dist lies between 0.928 and 0.955 and the IAC ranges between 0.9477 and 0.9622. Similarly, the identification results, for the uncertain environment, indicate that the PomAURPC-OVA_dist of Bi-ADIPOK varies between 0.8576 and 0.9273 and its IAC is between 0.8693 and 0.9318. For the certain environment, the PomAURPC-OVA_dist lies between 0.8613 and 0.9351 and the IAC values are between 0.8672 and 0.9476. With uncertain data, Bi-ADIPOK can find 35% more code smells than the second best approach (i.e., BLOP). Furthermore, Bi-ADIPOK has succeeded to reduce the number of false alarms (i.e., misclassified smelly instances) by 12%. In addition, our proposed approach can identify 43% more smell types than BLOP and reduces the number of false alarms by 32%. The same results have been obtained for the certain environment, demonstrating Bi-ADIPOK's ability to deal with such environment
    • …
    corecore