Search CORE

1,369 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Directory of Open Access Journals

eScholarship - University of California

Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis

Author: Al-Rajab Murad
Lu Joan
Qiang Xu
Publication venue: 'Elsevier BV'
Publication date: 01/07/2017
Field of study

Background and Objectives: This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. Methods: In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. Results: It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Sup- port Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). Conclusions: It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society

University of Huddersfield Repository

Huddersfield Research Portal

One-Class Classification: Taxonomy of Study and Review of Techniques

Author: Khan Shehroz S.
Madden Michael G.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 29/11/2013
Field of study

One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

arXiv.org e-Print Archive

Access to Research at National University of Ireland, Galway

A novel optimized deep learning method for protein-protein prediction in bioinformatics

Author: Chillar Rajender Singh
Thareja Preeti
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/02/2024
Field of study

Proteins have been shown to perform critical activities in cellular processes and are required for the organism's existence and proliferation. On complicated protein-protein interaction (PPI) networks, conventional centrality approaches perform poorly. Machine learning algorithms based on enormous amounts of data do not make use of biological information's temporal and spatial dimensions. As a result, we developed a sequence-dependent PPI prediction model using an Aquila and shark noses-based hybrid prediction technique. This model operates in two stages: feature extraction and prediction. The features are acquired using the semantic similarity technique for good results. The acquired features are utilized to predict the PPI using hybrid deep networks long short-term memory (LSTM) networks and restricted Boltzmann machines (RBMs). The weighting parameters of these neural networks (NNs) were changed using a novel optimization approach hybrid of aquila and shark noses (ASN), and the results revealed that our proposed ASN-based PPI prediction is more accurate and efficient than other existing techniques

Institute of Advanced Engineering and Science

Evaluation of machine learning approaches for prediction of protein coding genes in prokaryotic DNA sequences

Author: Sandvik Yva Jacob
Publication venue: Norwegian University of Life Sciences, Ås
Publication date: 01/01/2022
Field of study

According to the National Human Genome Research Institute the amount of genomic data generated on a yearly basis is constantly increasing. This rapid growth in genomic data has led to a subsequent surge in the demand for efficient analysis and handling of said data. Gene prediction involves identifying the areas of a DNA sequence that code for proteins, also called protein coding genes. This task falls within the scope of bioinformatics, and there has been surprisingly little development in this field of study, over the past years. Despite there being sufficient state-of-the-art gene prediction tools, there is still room for improvement in terms of efficiency and accuracy. Advances made within the field of gene prediction can, among other things, aid the medical and pharmaceutical industry, as well as impact environmental and anthropological research. Machine learning techniques such as the Random Forest classifiers and Artificial Neural Networks (ANN) have proved successful at the task of gene prediction. In this thesis one deep learning model and two other machine learning models were tested. The first model implemented was the established Random Forest classifier. When it comes to the use of ensemble methods, such as the Random Forest classifier, feature engineering is critical for the success of such models. The exploration of different feature selection and extraction techniques underpinned its relevance. It also showed that feature importance varies greatly among genomes, and revealed possibilities that can be further explored in future work. The second model tested was the ensemble method Extreme Gradient Boosting (XGBoost), which served as a good competitor to the Random Forest classifier. Finally, a Recurrent Neural Network (RNN) was implemented. RNNs are known to be good with handling sequential data, therefore it seemed like a good candidate for gene prediction. The 15 prokaryotic genomes used to train the models were extracted from the NCBI genome database. Each model was tasked with classifying sub-sequences of the genomes, called open reading frames (ORFs), as either protein coding ORFs, or non-coding ORFs. One challenge when preparing these datasets was that the number of protein coding ORFs was very small compared to the number of non-coding ORFs. Another problem encountered in the dataset was that protein coding ORFs in general are longer than non-coding ORFs, which can bias the models to simply classify long ORFs as protein coding, and short ORFs as non-coding. For these reasons, two datasets for each genome were created, taking each imbalance into account. The models were trained, tuned and tested on both datasets for all genomes, and a combination of genomes. The models were evaluated with regard to accuracy, precision and recall. The results show that all three methods have potential and attained somewhat similar performance scores. Despite the fact that both time and data were limited during model development, they still yielded promising results. Considering there are several parameters that have not yet been tuned in all models, many possibilities for further research remain. The fact that a relatively simple RNN architecture performed so well, and has no requirement for feature engineering, shows great promise for further applications in gene prediction, and possibly other fields in bioinformatics.M-D