26,829 research outputs found
Pairwise alignment incorporating dipeptide covariation
Motivation: Standard algorithms for pairwise protein sequence alignment make
the simplifying assumption that amino acid substitutions at neighboring sites
are uncorrelated. This assumption allows implementation of fast algorithms for
pairwise sequence alignment, but it ignores information that could conceivably
increase the power of remote homolog detection. We examine the validity of this
assumption by constructing extended substitution matrixes that encapsulate the
observed correlations between neighboring sites, by developing an efficient and
rigorous algorithm for pairwise protein sequence alignment that incorporates
these local substitution correlations, and by assessing the ability of this
algorithm to detect remote homologies. Results: Our analysis indicates that
local correlations between substitutions are not strong on the average.
Furthermore, incorporating local substitution correlations into pairwise
alignment did not lead to a statistically significant improvement in remote
homology detection. Therefore, the standard assumption that individual residues
within protein sequences evolve independently of neighboring positions appears
to be an efficient and appropriate approximation
Transcription Factor-DNA Binding Via Machine Learning Ensembles
We present ensemble methods in a machine learning (ML) framework combining
predictions from five known motif/binding site exploration algorithms. For a
given TF the ensemble starts with position weight matrices (PWM's) for the
motif, collected from the component algorithms. Using dimension reduction, we
identify significant PWM-based subspaces for analysis. Within each subspace a
machine classifier is built for identifying the TF's gene (promoter) targets
(Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool.
Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string)
feature PWM-based subspaces that stand out in identifying gene targets. We
approach Problem 3 (binding sites) with a novel machine learning approach that
uses promoter string features and ML importance scores in a classification
algorithm locating binding sites across the genome. For target gene
identification this method improves performance (measured by the F1 score) by
about 10 percentage points over the (a) motif scanning method and (b) the
coexpression-based association method. Top motif outperformed 5 component
algorithms as well as two other common algorithms (BEST and DEME). For
identifying individual binding sites on a benchmark cross species database
(Tompa et al., 2005) we match the best performer without much human
intervention. It also improved the performance on mammalian TFs.
The ensemble can integrate orthogonal information from different weak
learners (potentially using entirely different types of features) into a
machine learner that can perform consistently better for more TFs. The TF gene
target identification component (problem 1 above) is useful in constructing a
transcriptional regulatory network from known TF-target associations. The
ensemble is easily extendable to include more tools as well as future PWM-based
information.Comment: 33 page
ExplainIt! -- A declarative root-cause analysis engine for time series data (extended version)
We present ExplainIt!, a declarative, unsupervised root-cause analysis engine
that uses time series monitoring data from large complex systems such as data
centres. ExplainIt! empowers operators to succinctly specify a large number of
causal hypotheses to search for causes of interesting events. ExplainIt! then
ranks these hypotheses, reducing the number of causal dependencies from
hundreds of thousands to a handful for human understanding. We show how a
declarative language, such as SQL, can be effective in declaratively
enumerating hypotheses that probe the structure of an unknown probabilistic
graphical causal model of the underlying system. Our thesis is that databases
are in a unique position to enable users to rapidly explore the possible causal
mechanisms in data collected from diverse sources. We empirically demonstrate
how ExplainIt! had helped us resolve over 30 performance issues in a commercial
product since late 2014, of which we discuss a few cases in detail.Comment: SIGMOD Industry Track 201
A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. In this paper, we begin with a comprehensive
comparison of four popular, methodologically diverse OD methods: MultiParanoid,
Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to
significantly outperform one another 12-30% of the time. This high
complementarity motivates the presentation of the first tool for integrating
methodologically diverse OD methods. We term this program MOSAIC, or Multiple
Orthologous Sequence Analysis and Integration by Cluster optimization. Relative
to component and competing methods, we demonstrate that MOSAIC more than
quintuples the number of alignments for which all species are present, while
simultaneously maintaining or improving functional-, phylogenetic-, and
sequence identity-based measures of ortholog quality. Further, we demonstrate
that this improvement in alignment quality yields 40-280% more confidently
aligned sites. Combined, these factors translate to higher estimated levels of
overall conservation, while at the same time allowing for the detection of up
to 180% more positively selected sites. MOSAIC is available as python package.
MOSAIC alignments, source code, and full documentation are available at
http://pythonhosted.org/bio-MOSAIC
Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks
Predictive understanding of the myriads of signal transduction pathways in a
cell is an outstanding challenge of systems biology. Such pathways are
primarily mediated by specific but transient protein-protein interactions,
which are difficult to study experimentally. In this study, we dissect the
specificity of protein-protein interactions governing two-component signaling
(TCS) systems ubiquitously used in bacteria. Exploiting the large number of
sequenced bacterial genomes and an operon structure which packages many pairs
of interacting TCS proteins together, we developed a computational approach to
extract a molecular interaction code capturing the preferences of a small but
critical number of directly interacting residue pairs. This code is found to
reflect physical interaction mechanisms, with the strongest signal coming from
charged amino acids. It is used to predict the specificity of TCS interaction:
Our results compare favorably to most available experimental results, including
the prediction of 7 (out of 8 known) interaction partners of orphan signaling
proteins in Caulobacter crescentus. Surveying among the available bacterial
genomes, our results suggest 15~25% of the TCS proteins could participate in
out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking
candidates, expanding from the anecdotally known examples in model organisms.
The tools and results presented here can be used to guide experimental studies
towards a system-level understanding of two-component signaling.Comment: Supplementary information available on
http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972
Patient-specific data fusion for cancer stratification and personalised treatment
According to Cancer Research UK, cancer is a leading cause of death accounting for more than one in four of all deaths in 2011. The recent advances in experimental technologies in cancer research have resulted in the accumulation of large amounts of patient-specific datasets, which provide complementary information on the same cancer type. We introduce a versatile data fusion (integration) framework that can effectively integrate somatic mutation data, molecular interactions and drug chemical data to address three key challenges in cancer research: stratification of patients into groups having different clinical outcomes, prediction of driver genes whose mutations trigger the onset and development of cancers, and repurposing of drugs treating particular cancer patient groups. Our new framework is based on graph-regularised non-negative matrix tri-factorization, a machine learning technique for co-clustering heterogeneous datasets. We apply our framework on ovarian cancer data to simultaneously cluster patients, genes and drugs by utilising all datasets.We demonstrate superior performance of our method over the state-of-the-art method, Network-based Stratification, in identifying three patient subgroups that have significant differences in survival outcomes and that are in good agreement with other clinical data. Also, we identify potential new driver genes that we obtain by analysing the gene clusters enriched in known drivers of ovarian cancer progression. We validated the top scoring genes identified as new drivers through database search and biomedical literature curation. Finally, we identify potential candidate drugs for repurposing that could be used in treatment of the identified patient subgroups by targeting their mutated gene products. We validated a large percentage of our drug-target predictions by using other databases and through literature curation
- …