Search CORE

3,434 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Coupling different methods for overcoming the class imbalance problem

Author: Fantozzi Carlo
N. Lazzarini
Nanni Loris
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

Crossref

Newcastle University E-Prints

Archivio istituzionale della ricerca - Università di Padova

Mining heterogeneous enterprise data

Author: Jiang Xinxin
Publication venue
Publication date: 01/01/2018
Field of study

University of Technology Sydney. Faculty of Engineering and Information Technology.Heterogeneity is becoming one of the key characteristics inside enterprise data, because the current nature of globalization and competition stress the importance of leveraging huge amounts of enterprise accumulated data, according to various organizational processes, resources and standards. Effectively deriving meaningful insights from complex large-scaled heterogeneous enterprise data poses an interesting, but critical challenge. The aim of this thesis is to investigate the theoretical foundations of mining heterogeneous enterprise data in light of the above challenges and to develop new algorithms and frameworks that are able to effectively and efficiently consider heterogeneity in four elements of the data: objects, events, context, and domains. Objects describe a variety of business roles and instruments involved in business systems. Object heterogeneity means that object information at both the data and structural level is heterogeneous. The cost-sensitive hybrid neural network (Cs-HNN) proposed leverages parallel network architectures and an algorithm specifically designed for minority classification to generate a robust model for learning heterogeneous objects. Events trace an object’s behaviours or activities. Event heterogeneity reflects the level of variety in business events and is normally expressed in the type and format of features. The approach proposed in this thesis focuses on fleet tracking as a practical example of an application with a high degree of event heterogeneity. Context describes the environment and circumstances surrounding objects and events. Context heterogeneity reflects the degree of diversity in contextual features. The coupled collaborative filtering (CCF) approach proposed in this thesis is able to provide context-aware recommendations by measuring the non-independent and identically distributed (non-IID) relationships across diverse contexts. Domains are the sources of information and reflect the nature of the business or function that has generated the data. The cross-domain deep learning (Cd-DLA) proposed in this thesis provides a potential avenue to overcome the complexity and nonlinearity of heterogeneous domains. Each of the approaches, algorithms, and frameworks for heterogeneous enterprise data mining presented in this thesis outperform the state-of-the-art methods in a range of backgrounds and scenarios, as evidenced by a theoretical analysis, an empirical study, or both. All outcomes derived from this research have been published or accepted for publication, and the follow-up work has also been recognised, which demonstrates scholarly interest in mining heterogeneous enterprise data as a research topic. However, despite this interest, heterogeneous data mining still holds increasing attractive opportunities for further exploration and development in both academia and industry

OPUS - University of Technology Sydney

Class-Imbalanced Learning on Graphs: A Survey

Author: Chawla Nitesh V.
Ma Yihong
Moniz Nuno
Tian Yijun
Publication venue
Publication date: 09/04/2023
Field of study

The rapid advancement in data-driven research has increased the demand for effective graph data analysis. However, real-world data often exhibits class imbalance, leading to poor performance of machine learning models. To overcome this challenge, class-imbalanced learning on graphs (CILG) has emerged as a promising solution that combines the strengths of graph representation learning and class-imbalanced learning. In recent years, significant progress has been made in CILG. Anticipating that such a trend will continue, this survey aims to offer a comprehensive understanding of the current state-of-the-art in CILG and provide insights for future research directions. Concerning the former, we introduce the first taxonomy of existing work and its connection to existing imbalanced learning literature. Concerning the latter, we critically analyze recent work in CILG and discuss urgent lines of inquiry within the topic. Moreover, we provide a continuously maintained reading list of papers and code at https://github.com/yihongma/CILG-Papers.Comment: submitted to ACM Computing Survey (CSUR

arXiv.org e-Print Archive

Handling minority class problem in threats detection based on heterogeneous ensemble learning approach.

Author: Ahriz Hatem
Eke Hope
Petrovski Andrei
Publication venue: IGI Global
Publication date: 31/07/2020
Field of study

Multiclass problem, such as detecting multi-steps behaviour of Advanced Persistent Threats (APTs) have been a major global challenge, due to their capability to navigates around defenses and to evade detection for a prolonged period of time. Targeted APT attacks present an increasing concern for both cyber security and business continuity. Detecting the rare attack is a classification problem with data imbalance. This paper explores the applications of data resampling techniques, together with heterogeneous ensemble approach for dealing with data imbalance caused by unevenly distributed data elements among classes with our focus on capturing the rare attack. It has been shown that the suggested algorithms provide not only detection capability, but can also classify malicious data traffic corresponding to rare APT attacks

Open Access Institutional Repository at Robert Gordon University

Evolutionary deep belief networks with bootstrap sampling for imbalanced class datasets

Author: Amri A’inur A’fifah
Ismail Amelia Ritahani
Mohammad Omar Abdelaziz
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 01/07/2019
Field of study

Imbalanced class data is a common issue faced in classification tasks. Deep Belief Networks (DBN) is a promising deep learning algorithm when learning from complex feature input. However, when handling imbalanced class data, DBN encounters low performance as other machine learning algorithms. In this paper, the genetic algorithm (GA) and bootstrap sampling are incorporated into DBN to lessen the drawbacks occurs when imbalanced class datasets are used. The performance of the proposed algorithm is compared with DBN and is evaluated using performance metrics. The results showed that there is an improvement in performance when Evolutionary DBN with bootstrap sampling is used to handle imbalanced class datasets

International Journal of Advances in Intelligent Informatics

The International Islamic University Malaysia Repository

International Journal of Advances in Intelligent Informatics (IJAIN)

Machine Learning Approach for Classifying Power Outage in Secondary Electric Distribution Network

Author: Maziku Hellen
Mgaya Stephan
Publication venue: College of Engineering and Technology, University of Dar es Salaam.
Publication date: 15/07/2022
Field of study

Power outage is the problem that hinders social and economic development especially for developing countries like Tanzania. Frequent power outages damage electric equipment, and negatively affect the industrial production process. Power outages cannot be completely eradicated due to uncontrolled cause like natural calamities but technical challenges can be managed and hence reducing power outages. The existing manual methods used to locate power outage like customer calls is inefficient and time consuming. On the other hand, modern method like the Advanced Metering Infrastructure (AMI) still faces a challenge in effectively classifying power line outage due to the nature of imbalanced datasets. Therefore, there is a need to develop a Machine Learning (ML) model to accurately classify power line outage. In this study, machine learning models are constructed from ensemble algorithms and tested using outage AMI data from 2012 to 2019 with 2 hours interval records. We propose the following ensemble-based machine learning approach to enhance classification; data sampling, algorithm weighting and finally ensembling. Results show that the Hybrid Stacking Ensemble Classifier (HSEC) model outperforms the others by accuracy of 0.981 G-mean, followed by Extra tree with accuracy of 0.964 G-mean. This model can be used in power line outage classification in any Secondary Electrical Distribution Network (SEDN). This study can be extended to locate power outage to household

AJOL - African Journals Online

An Effective Data Sampling Procedure for Imbalanced Data Learning on Health Insurance Fraud Detection

Author: Kotekani Shamitha S.
Velchamy Ilango
Publication venue: 'Faculty of Electrical Engineering and Computing, Univ. of Zagreb'
Publication date: 01/01/2020
Field of study

Fraud detection has received considerable attention from many academic research and industries worldwide due to its increasing popularity. Insurance datasets are enormous, with skewed distributions and high dimensionality. Skewed class distribution and its volume are considered significant problems while analyzing insurance datasets, as these issues increase the misclassification rates. Although sampling approaches, such as random oversampling and SMOTE can help balance the data, they can also increase the computational complexity and lead to a deterioration of model\u27s performance. So, more sophisticated techniques are needed to balance the skewed classes efficiently. This research focuses on optimizing the learner for fraud detection by applying a Fused Resampling and Cleaning Ensemble (FusedRCE) for effective sampling in health insurance fraud detection. We hypothesized that meticulous oversampling followed with a guided data cleaning would improve the prediction performance and learner\u27s understanding of the minority fraudulent classes compared to other sampling techniques. The proposed model works in three steps. As a first step, PCA is applied to extract the necessary features and reduce the dimensions in the data. In the second step, a hybrid combination of k-means clustering and SMOTE oversampling is used to resample the imbalanced data. Oversampling introduces lots of noise in the data. A thorough cleaning is performed on the balanced data to remove the noisy samples generated during oversampling using the Tomek Link algorithm in the third step. Tomek Link algorithm clears the boundary between minority and majority class samples and makes the data more precise and freer from noise. The resultant dataset is used by four different classification algorithms: Logistic Regression, Decision Tree Classifier, k-Nearest Neighbors, and Neural Networks using repeated 5-fold cross-validation. Compared to other classifiers, Neural Networks with FusedRCE had the highest average prediction rate of 98.9%. The results were also measured using parameters such as F1 score, Precision, Recall and AUC values. The results obtained show that the proposed method performed significantly better than any other fraud detection approach in health insurance by predicting more fraudulent data with greater accuracy and a 3x increase in speed during training

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem

Author: Christanto Febrian Wahyu
Indriyawati Henny
Suntoro Joko
Publication venue: 'Universitas Kristen Satya Wacana'
Publication date: 31/10/2018
Field of study

The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696

Portal Jurnal Elektronik Universitas Kristen Satya Wacana (UKSW)