6,447 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
On the class overlap problem in imbalanced data classification.
Class imbalance is an active research area in the machine learning community. However, existing and recent literature showed that class overlap had a higher negative impact on the performance of learning algorithms. This paper provides detailed critical discussion and objective evaluation of class overlap in the context of imbalanced data and its impact on classification accuracy. First, we present a thorough experimental comparison of class overlap and class imbalance. Unlike previous work, our experiment was carried out on the full scale of class overlap and an extreme range of class imbalance degrees. Second, we provide an in-depth critical technical review of existing approaches to handle imbalanced datasets. Existing solutions from selective literature are critically reviewed and categorised as class distribution-based and class overlap-based methods. Emerging techniques and the latest development in this area are also discussed in detail. Experimental results in this paper are consistent with existing literature and show clearly that the performance of the learning algorithm deteriorates across varying degrees of class overlap whereas class imbalance does not always have an effect. The review emphasises the need for further research towards handling class overlap in imbalanced datasets to effectively improve learning algorithms’ performance
Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.
Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem
An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis
Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced
classification. Furthermore, data characteristics have a significant impact on the performance
of imbalanced classifiers, which are generally neglected by existing evaluation
methods. The objective of this study is to introduce a new criterion to comprehensively
evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established
using data envelopment analysis without explicit inputs (DEA-WEI), to determine
the trade-off between the benefits of improved minority class accuracy and the cost of
reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced
ratio and typical imbalanced data characteristics on the efficiency of the classifiers.
Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as
C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble
and undersampling techniques are more effective for overlapping and noisy data. The efficiency
of cost-sensitive classifiers decreases dramatically when the imbalanced ratio
increases. Finally, we investigate the reasons for the different efficiencies of classifiers on
imbalanced data and recommend steps to select appropriate classifiers for imbalanced data
based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023
71725001
71771037
7197104
- …