Search CORE

3,945 research outputs found

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

One Decade of Development and Evolution of MicroRNA Target Prediction Algorithms

Author: Alexiou
Altuvia
Baek
Bandyopadhyay
Barreau
Bartel
Bartel
Betel
Betel
Cai
Chandra
Chi
Didiano
Elisa Ficarra
Enright
Friedman
Friedman
Gaidatzis
Garcia
Griffiths-Jones
Grimson
Guo
Hafner
Hsu
Jacobsen
Ji
Jin
John
Kertesz
Khan
Kim
Kiriakidou
Kruger
Kumar
Lagos-Quintana
Lall
Lee
Lee
Lewis
Lim
Lund
Maragkakis
Mendes
Min
Miranda
Muckstein
Pandey
Papadopoulos
Paula H. Reyes∼Herrera
Reyes∼Herrera
Saetrom
Sandberg
Schmidt
Selbach
Stefani
Sturm
Tan
Thomas
Thomson
Vergoulis
Wang
Watanabe
Wen
Witkos
Xiao
Yan
Yang
Yang
Yousef
Zhao
Publication venue: Elsevier BV:PO Box 211, 1000 AE Amsterdam Netherlands:011 31 20 4853757, 011 31 20 4853642, 011 31 20 4853641, EMAIL: [email protected], INTERNET: http://www.elsevier.nl, Fax: 011 31 20 4853598
Publication date: 01/01/2012
Field of study

Nearly two decades have passed since the publication of the first study reporting the discovery of microRNAs (miRNAs). The key role of miRNAs in post-transcriptional gene regulation led to the performance of an increasing number of studies focusing on origins, mechanisms of action and functionality of miRNAs. In order to associate each miRNA to a specific functionality it is essential to unveil the rules that govern miRNA action. Despite the fact that there has been significant improvement exposing structural characteristics of the miRNA-mRNA interaction, the entire physical mechanism is not yet fully understood. In this respect, the development of computational algorithms for miRNA target prediction becomes increasingly important. This manuscript summarizes the research done on miRNA target prediction. It describes the experimental data currently available and used in the field and presents three lines of computational approaches for target prediction. Finally, the authors put forward a number of considerations regarding current challenges and future direction

Elsevier - Publisher Connector

Crossref

PubMed Central

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

PORTO Publications Open Repository TOrino

Transcription Factor-DNA Binding Via Machine Learning Ensembles

Author: DeLisi Charles
Fan Yue
Kon Mark
Publication venue
Publication date: 09/05/2018
Field of study

We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page

arXiv.org e-Print Archive

Boston University Institutional Repository (OpenBU)

Classifying pairs with trees for supervised biological network inference

Author: Babu M. Madan
Geurts Pierre
Schrynemackers Marie
Wehenkel Louis
Publication venue
Publication date: 24/04/2014
Field of study

Networks are ubiquitous in biology and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the later for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.Comment: 22 page

arXiv.org e-Print Archive

CiteSeerX

An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions

Author: Aaron N. Chang
AH Tong
AH Tong
AP Gasch
AP Jarvinen
Bin Zhang
C Boone
C Rodrigues-Pousada
Chad L. Myers
CL Myers
D Lin
DS McNabb
EA Winzeler
Eric E. Schadt
G Pandey
G Weiss
Gary D. Bader
Gaurav Pandey
IH Witten
J Zhu
JS Edwards
Jun Zhu
K Tan
KC Chipman
L Fernandes
L Royer
M Costanzo
M Kanehisa
PT Spellman
PW Lord
R Kelley
RB Brem
RO Duda
S Mnaimneh
S Onami
SF Altschul
SL Wong
SR Collins
SR Paladugu
SV Date
T Mitchell
T Nevitt
TG Dietterich
TR Hughes
Vipin Kumar
W Zhong
Y Qi
Y Tao
Z Hu
Publication venue: Public Library of Science
Publication date: 09/09/2010
Field of study

Genetic interactions occur when a combination of mutations results in a surprising phenotype. These interactions capture functional redundancy, and thus are important for predicting function, dissecting protein complexes into functional pathways, and exploring the mechanistic underpinnings of common human diseases. Synthetic sickness and lethality are the most studied types of genetic interactions in yeast. However, even in yeast, only a small proportion of gene pairs have been tested for genetic interactions due to the large number of possible combinations of gene pairs. To expand the set of known synthetic lethal (SL) interactions, we have devised an integrative, multi-network approach for predicting these interactions that significantly improves upon the existing approaches. First, we defined a large number of features for characterizing the relationships between pairs of genes from various data sources. In particular, these features are independent of the known SL interactions, in contrast to some previous approaches. Using these features, we developed a non-parametric multi-classifier system for predicting SL interactions that enabled the simultaneous use of multiple classification procedures. Several comprehensive experiments demonstrated that the SL-independent features in conjunction with the advanced classification scheme led to an improved performance when compared to the current state of the art method. Using this approach, we derived the first yeast transcription factor genetic interaction network, part of which was well supported by literature. We also used this approach to predict SL interactions between all non-essential gene pairs in yeast (http://sage.fhcrc.org/downloads/downloads/predicted_yeast_genetic_interactions.zip). This integrative approach is expected to be more effective and robust in uncovering new genetic interactions from the tens of millions of unknown gene pairs in yeast and from the hundreds of millions of gene pairs in higher organisms like mouse and human, in which very few genetic interactions have been identified to date

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Prediction of aptamer-protein interacting pairs using an ensemble classifier in combination with various protein sequence attributes

Author: Chengjin Zhang
Lina Zhang
Qing Song
Rui Gao
Runtao Yang
Publication venue: Springer Nature
Publication date: 01/01/2016
Field of study

The ranked feature list given by the Relief algorithm. Within the list, a feature with a smaller index indicates that it is more important for aptamer-protein interacting pair prediction. Such a list of ranked features are used to establish the optimal feature set in the IFS procedure. (XLS 56.5 kb

Springer - Publisher Connector

FigShare

Prediction of DNA-Binding Proteins and their Binding Sites

Author: Pokhrel Pujan
Publication venue: ScholarWorks@UNO
Publication date: 01/05/2018
Field of study

DNA-binding proteins play an important role in various essential biological processes such as DNA replication, recombination, repair, gene transcription, and expression. The identification of DNA-binding proteins and the residues involved in the contacts is important for understanding the DNA-binding mechanism in proteins. Moreover, it has been reported in the literature that the mutations of some DNA-binding residues on proteins are associated with some diseases. The identification of these proteins and their binding mechanism generally require experimental techniques, which makes large scale study extremely difficult. Thus, the prediction of DNA-binding proteins and their binding sites from sequences alone is one of the most challenging problems in the field of genome annotation. Since the start of the human genome project, many attempts have been made to solve the problem with different approaches, but the accuracy of these methods is still not suitable to do large scale annotation of proteins. Rather than relying solely on the existing machine learning techniques, I sought to combine those using novel “stacking technique” and used the problem-specific architectures to solve the problem with better accuracy than the existing methods. This thesis presents a possible solution to the DNA-binding proteins prediction problem which performs better than the state-of-the-art approaches

University of New Orleans

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy

Author: Ju Ying
Tang Jijun
Wan Shixiang
Zeng Xiangxiang
Zou Quan
Publication venue
Publication date: 07/03/2017
Field of study

Background: It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. Results: Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA- binding protein prediction accuracy, which is better than all other existing methods. Conclusions: The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/

arXiv.org e-Print Archive

Springer - Publisher Connector