2,865 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination
Classification and Scoring of Protein Complexes
Proteins interactions mediate all biological systems in a cell; understanding their interactions
means understanding the processes responsible for human life. Their structure can
be obtained experimentally, but such processes frequently fail at determining structures
of protein complexes. To address the issue, computational methods have been developed
that attempt to predict the structure of a protein complex, using information of its constituents.
These methods, known as docking, generate thousands of possible poses for
each complex, and require effective and reliable ways to quickly discriminate the correct
pose among the set of incorrect ones. In this thesis, a new scoring function was developed
that uses machine learning techniques and features extracted from the structure of the
interacting proteins, to correctly classify and rank the putative poses. The developed
function has shown to be competitive with current state-of-the-art solutions
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
A Proposed Approach for Predicting Liver Disease
One of the main challenges is to exploit recent technologies in a way that is able to preserve human life. Liver disease is one of the most influencing and largest organs of the human body, which has a great impact on human life, according to the massive number of deaths of this disease. So, it is important to predict liver disease with the maximum possible accuracy, as the current problem is the weak accuracy of predicting liver disease and not predicting the severity of the liver disease. Thus, through this paper, the aim behind our proposed work is to enhance the performance of predicting liver disease, predicting the severity of liver disease, and then building a recommender system that recommends the appropriate medical pieces of advice according to the patients condition using machine learning algorithms and tools like a GridsearchCV tool. Indian liver patients dataset (ILPD) and the hepatitis C virus (HCV) dataset are our training datasets. Hence, the proposed solution enhanced the prediction accuracy of liver disease by 80% and 77 % for extra tree and KNN algorithms when using ILPD datasets. And when using the HCV dataset, the accuracy is achieved by the Gradient boosting algorithm and Logistic Regression by 96% for predicting liver disease, disease severity, and patient recommendation system model
DTi2Vec: Drug-target interaction prediction using network embedding and ensemble learning.
Drug-target interaction (DTI) prediction is a crucial step in drug discovery and repositioning as it reduces experimental validation costs if done right. Thus, developing in-silico methods to predict potential DTI has become a competitive research niche, with one of its main focuses being improving the prediction accuracy. Using machine learning (ML) models for this task, specifically network-based approaches, is effective and has shown great advantages over the other computational methods. However, ML model development involves upstream hand-crafted feature extraction and other processes that impact prediction accuracy. Thus, network-based representation learning techniques that provide automated feature extraction combined with traditional ML classifiers dealing with downstream link prediction tasks may be better-suited paradigms. Here, we present such a method, DTi2Vec, which identifies DTIs using network representation learning and ensemble learning techniques. DTi2Vec constructs the heterogeneous network, and then it automatically generates features for each drug and target using the nodes embedding technique. DTi2Vec demonstrated its ability in drug-target link prediction compared to several state-of-the-art network-based methods, using four benchmark datasets and large-scale data compiled from DrugBank. DTi2Vec showed a statistically significant increase in the prediction performances in terms of AUPR. We verified the novel predicted DTIs using several databases and scientific literature. DTi2Vec is a simple yet effective method that provides high DTI prediction performance while being scalable and efficient in computation, translating into a powerful drug repositioning tool
- …