Search CORE

13,626 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Directory of Open Access Journals

eScholarship - University of California

Robust classification of high dimensional unbalanced single and multi-label datasets

Author: Braytee Ali
Publication venue
Publication date: 01/01/2018
Field of study

University of Technology Sydney. Faculty of Engineering and Information Technology.Single and multi-label classification are arguably two of the most important topics within the field of machine learning. Single-label classification refers to the case where each sample is assigned to one class, and multi-label classification is where instances are associated with multiple labels simultaneously. Nowadays, research to build robust single and multi-label classification models is still ongoing in the data analytics community because of the emerging complexities in the real-world data, and due to the increasingly research interest in use of data analytics techniques in many fields including biomedicine, finance, text mining, text categorization, and images. Real-world datasets contain complexities which degrade the performance of classifiers. These complexities or open challenges are: imbalanced data, low numbers of samples, high-dimensionality, highly correlated features, label correlations, and missing labels in multi-label space. Several research gaps are identified and motivate this thesis. Class imbalance occurs when the distribution of classes is not uniform among samples. Feature extraction is used to reduce the dimensionality of data. However, the presence of highly imbalanced data in single-label classification misleads existing unsupervised and supervised feature extraction techniques. It produces features biased towards classification of the class with the majority of samples, and results in poor classification performance especially for the minor class. Furthermore, imbalanced multi-labeled data is more ubiquitous than single-labeled data because of several issues including label correlation, incomplete multi-label matrices, and noisy and irrelevant features. High-dimensional highly correlated data exist in several domains such as genomics. Many feature selection techniques consider correlated features as redundant and therefore need to be removed. Several studies investigate the interpretation of the correlated features in domains such as genomics, but investigating the classification capabilities of the correlated feature groups in single-labeled data is a point of interest in several domains. Moreover, high-dimensional multi-labeled data is more challenging than single-labeled data. Only relatively few feature selection methods have been proposed to select the discriminative features among multiple labels due to issues including interdependent labels, different instances sharing different label correlations, correlated features, and missing and noisy labels. This thesis proposes a series of novel algorithms for machine learning to handle the negative effects of the above mentioned problems and improves the performance of the classifiers in single and multi-labeled data. There are seven contributions in this thesis. Contribution 1 proposes novel cost-sensitive principal component analysis (CSPCA) and cost-sensitive non-negative matrix factorization (CSNMF) methods for handling feature extraction of imbalanced single-labeled data. Contribution 2 extends a standard non-negative matrix factorization to a balanced supervised non-negative matrix factorization (BSNMF) to handle the class imbalance problem in supervised non-negative matrix factorization. Contribution 3 introduces an ABC-Sampling algorithm for balancing imbalanced datasets based on Artificial Bee Colony algorithm. Contribution 4 develops a novel supervised feature selection algorithm (SCANMF) by jointly integrating correlation network and structural analysis of the balanced supervised non-negative matrix factorization to handle high-dimensional, highly correlated single-labeled data. Contribution 5 proposes an ensemble feature ranking method using co-expression networks to select optimal features for classification. Contribution 6 proposes a Correlated- and Multi-label Feature Selection method (CMFS), based on NMF for simultaneously performing multi-label feature selection and addressing the following challenges: interdependent labels, different instances sharing different label correlations, correlated features, and missing and awed labels. Contribution 7 presents an integrated multi-label approach (ML-CIB) for simultaneously training the multi-label classification model and addressing the following challenges namely, class imbalance, label correlation, incomplete multi-label matrices, and noisy and irrelevant features. The performance of all novel algorithms in this thesis is evaluated in terms of single and multi-label classification accuracy. The proposed algorithms are evaluated in the context of a childhood leukaemia dataset from The Children Hospital at Westmead, and public datasets for different fields including genomics, finance, text mining, images, and others from online repositories. Moreover, all the results of the proposed algorithms in this thesis are compared to state-of-the-art methods. The experimental results indicate that the proposed algorithms outperform the state-of-the-art methods. Further, several statistical tests including, t-test and Friedman test are applied to evaluate the results to demonstrate the statistical significance of the proposed methods in this thesis

OPUS - University of Technology Sydney

Predicting Pancreatic Cancer Using Support Vector Machine

Author: Bodkhe Akshay
Publication venue: SJSU ScholarWorks
Publication date: 26/05/2017
Field of study

This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately

SJSU ScholarWorks