252 research outputs found
Beyond similarity: A network approach for identifying and delimiting biogeographical regions
Biogeographical regions (geographically distinct assemblages of species and
communities) constitute a cornerstone for ecology, biogeography, evolution and
conservation biology. Species turnover measures are often used to quantify
biodiversity patterns, but algorithms based on similarity and clustering are
highly sensitive to common biases and intricacies of species distribution data.
Here we apply a community detection approach from network theory that
incorporates complex, higher order presence-absence patterns. We demonstrate
the performance of the method by applying it to all amphibian species in the
world (c. 6,100 species), all vascular plant species of the USA (c. 17,600),
and a hypothetical dataset containing a zone of biotic transition. In
comparison with current methods, our approach tackles the challenges posed by
transition zones and succeeds in identifying a larger number of commonly
recognised biogeographical regions. This method constitutes an important
advance towards objective, data derived identification and delimitation of the
world's biogeographical regions.Comment: 5 figures and 1 supporting figur
BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs
<p>Abstract</p> <p>Background</p> <p>The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest.</p> <p>Results</p> <p>BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH) and the PharmacoKinetic Interaction Screening (PKIS) database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100%) and 157 (of 197) minor drug classes (80%) with areas under the receiver operating characteristic curve (AUC) > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238) adverse events (73%), up to 12 (of 15) groups of clinically significant cytochrome P450 enzyme (CYP) inducers or inhibitors (80%), and up to 11 (of 14) groups of narrow therapeutic index drugs (79%). Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task.</p> <p>Conclusions</p> <p>BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.</p
Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.
Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task
Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective
Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them
Simulation Intelligence: Towards a New Generation of Scientific Methods
The original "Seven Motifs" set forth a roadmap of essential methods for the
field of scientific computing, where a motif is an algorithmic method that
captures a pattern of computation and data movement. We present the "Nine
Motifs of Simulation Intelligence", a roadmap for the development and
integration of the essential algorithms necessary for a merger of scientific
computing, scientific simulation, and artificial intelligence. We call this
merger simulation intelligence (SI), for short. We argue the motifs of
simulation intelligence are interconnected and interdependent, much like the
components within the layers of an operating system. Using this metaphor, we
explore the nature of each layer of the simulation intelligence operating
system stack (SI-stack) and the motifs therein: (1) Multi-physics and
multi-scale modeling; (2) Surrogate modeling and emulation; (3)
Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based
modeling; (6) Probabilistic programming; (7) Differentiable programming; (8)
Open-ended optimization; (9) Machine programming. We believe coordinated
efforts between motifs offers immense opportunity to accelerate scientific
discovery, from solving inverse problems in synthetic biology and climate
science, to directing nuclear energy experiments and predicting emergent
behavior in socioeconomic settings. We elaborate on each layer of the SI-stack,
detailing the state-of-art methods, presenting examples to highlight challenges
and opportunities, and advocating for specific ways to advance the motifs and
the synergies from their combinations. Advancing and integrating these
technologies can enable a robust and efficient hypothesis-simulation-analysis
type of scientific method, which we introduce with several use-cases for
human-machine teaming and automated science
Computational studies of drug-binding kinetics
The drug-receptor binding kinetics are defined by the rate at which a given drug associates with and dissociates from its binding site on its macromolecular receptor. The lead optimization stage of drug discovery programs usually emphasizes optimizing the affinity (as described by the equilibrium dissociation constant, Kd) of a drug
which depends on the strength of its binding to a specific target. Since affinity is optimized under equilibrium conditions, it does not always ensures higher potency
in vivo. There has been a growing consensus that, in addition to Kd, kinetic parameters (kon and koff ) should be optimized to improve the chances of a good clinical
outcome. However, current understanding of the physicochemical features that contribute to differences in binding kinetics is limited. Experimental methods that are
used to determine kinetic parameters for drug binding and unbinding are often time consuming and labor-intensive. Therefore, robust, high-throughput in silico methods
are needed to predict binding kinetic parameters and to explore the mechanistic determinants of drug-protein binding. As the experimental data on drug-binding
kinetics is continuously growing and the number of crystallographic structures of ligand-receptor complexes is also increasing, methods to compute three dimensional (3D) Quantitative-Structure-Kinetics relationships (QSKRs) offer great potential for predicting kinetic rate constants for new compounds. COMparative BINding Energy(COMBINE) analysis is one example of such approach that was developed to derive
target-specific scoring functions based on molecular mechanics calculations. It has been used extensively to predict properties such as binding affinity, target selectivity, and substrate specificity. In this thesis, I made the first application of COMBINE analysis to derive Quantitative Structure-Kinetics Relationships (QSKRs) for the dissociation rates. I obtained models for koff of inhibitors of HIV-1 protease and heat shock protein 90 (HSP90) with very good predictive power and identified the
key ligand-receptor interactions that contribute to the variance in binding kinetics.
With technological and methodological advances, the use of all-atom unbiased Molecular Dynamics (MD) simulations can allow sampling upto the millisecond timescale and investigation of the kinetic profile of drug binding and unbinding to a receptor. However, the residence times of drug-receptor complexes are usually longer than the timescales that are feasible to simulate using conventional molecular dynamics techniques. Enhanced sampling methods can allow faster sampling of protein and ligand dynamics, thereby resulting in application of MD techniques
to study longer timescale processes. I have evaluated the application of Tau-Random Acceleration Molecular Dynamics (Tau-RAMD), an enhanced sampling method based on MD, to compute the relative residence times of a series of compounds binding to Haspin kinase. A good correlation (R2 = 0.86) was observed between the computed residence times and the experimental residence times of these compounds. I also performed interaction energy calculations, both at the quantum chemical level and at the molecular mechanics level, to explain the experimental observation that the residence
times of kinase inhibitors can be prolonged by introducing halogen-aromatic pi interactions between halogen atoms of inhibitors and aromatic residues at the binding site of kinases. I determined different energetic contributions to this highly polar and directional halogen-bonding interaction by partitioning the total interaction energy calculated at the quantum-chemical level into its constituent energy components. It was observed that the major contribution to this interaction energy comes from the correlation energy which describes second-order intermolecular dispersion interactions and the correlation corrections to the Hartree-Fock energy.
In addition, a protocol to determine diffusional kon rates of low molecular weight compounds from Brownian Dynamics (BD) simulations of protein-ligand association was established using SDA 7 software. The widely studied test case of benzamidine binding to trypsin was used to evaluate a set of parameters and a robust set of optimal parameters was determined that should be generally applicable for computing the diffusional association rate constants of a wide range of protein-ligand binding pairs. I validated this protocol on inhibitors of several targets with varying complexity such as Human Coagulation Factor Xa, Haspin kinase and N1 Neuraminidase, and the computed diffusional association rate constants were compared with the
experiments. I contributed to the development of a toolbox of computational methods: KBbox (http://kbbox.h-its.org/toolbox/), which provides information about
various computational methods to study molecular binding kinetics, and different computational tools that employ them. It was developed to guide researchers on the use of the different computational and simulation approaches available to compute the kinetic parameters of drug-protein binding
Place cell physiology in a transgenic mouse model of Alzheimer's disease
Alzheimer's disease (AD) is a multifactorial neurodegenerative disorder characterized by progressive cognitive impairments (Selkoe, 2001). Hippocampal place cells are a well understood candidate for the neural basis of one type of memory in rodents; these cells identify the animal's location in an environment and are crucial for spatial memory and navigation. This PhD project aims to clarify the mechanisms responsible for the cognitive
deficits in AD at the hippocampal network level, by examining place cell physiology in a transgenic mouse model of AD. I have recorded place cells in tg2576 mice, and found that aged (16 months) but not young (3 months) transgenic mice show degraded neuronal representations of the environment. The level of place cell degradation correlates with the animals' (poorer) spatial memory as tested in a forced-choice spatial alternation T-maze task and with hippocampal, but not neocortical, amyloid plaque burden. Additionally, pilot data show that physiological changes of the hippocampus in tg2576 mice seem to start as
early as 3 months, when no pathological and behavioural deficits are present. However, these changes are not obvious at the neuronal level, but only at the hippocampal network
level, which represent hippocampal responses to environmental changes. Place cell recording provides a sensitive assay for measuring the amount and rate of functional deterioration in animal models of dementia as well as providing a quantifiable physiological indication of the beneficial effects of potential therapies
- …