252 research outputs found

    Beyond similarity: A network approach for identifying and delimiting biogeographical regions

    Full text link
    Biogeographical regions (geographically distinct assemblages of species and communities) constitute a cornerstone for ecology, biogeography, evolution and conservation biology. Species turnover measures are often used to quantify biodiversity patterns, but algorithms based on similarity and clustering are highly sensitive to common biases and intricacies of species distribution data. Here we apply a community detection approach from network theory that incorporates complex, higher order presence-absence patterns. We demonstrate the performance of the method by applying it to all amphibian species in the world (c. 6,100 species), all vascular plant species of the USA (c. 17,600), and a hypothetical dataset containing a zone of biotic transition. In comparison with current methods, our approach tackles the challenges posed by transition zones and succeeds in identifying a larger number of commonly recognised biogeographical regions. This method constitutes an important advance towards objective, data derived identification and delimitation of the world's biogeographical regions.Comment: 5 figures and 1 supporting figur

    BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest.</p> <p>Results</p> <p>BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH) and the PharmacoKinetic Interaction Screening (PKIS) database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100%) and 157 (of 197) minor drug classes (80%) with areas under the receiver operating characteristic curve (AUC) > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238) adverse events (73%), up to 12 (of 15) groups of clinically significant cytochrome P450 enzyme (CYP) inducers or inhibitors (80%), and up to 11 (of 14) groups of narrow therapeutic index drugs (79%). Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task.</p> <p>Conclusions</p> <p>BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.</p

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

    Get PDF
    Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

    Simulation Intelligence: Towards a New Generation of Scientific Methods

    Full text link
    The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science

    Computational studies of drug-binding kinetics

    Get PDF
    The drug-receptor binding kinetics are defined by the rate at which a given drug associates with and dissociates from its binding site on its macromolecular receptor. The lead optimization stage of drug discovery programs usually emphasizes optimizing the affinity (as described by the equilibrium dissociation constant, Kd) of a drug which depends on the strength of its binding to a specific target. Since affinity is optimized under equilibrium conditions, it does not always ensures higher potency in vivo. There has been a growing consensus that, in addition to Kd, kinetic parameters (kon and koff ) should be optimized to improve the chances of a good clinical outcome. However, current understanding of the physicochemical features that contribute to differences in binding kinetics is limited. Experimental methods that are used to determine kinetic parameters for drug binding and unbinding are often time consuming and labor-intensive. Therefore, robust, high-throughput in silico methods are needed to predict binding kinetic parameters and to explore the mechanistic determinants of drug-protein binding. As the experimental data on drug-binding kinetics is continuously growing and the number of crystallographic structures of ligand-receptor complexes is also increasing, methods to compute three dimensional (3D) Quantitative-Structure-Kinetics relationships (QSKRs) offer great potential for predicting kinetic rate constants for new compounds. COMparative BINding Energy(COMBINE) analysis is one example of such approach that was developed to derive target-specific scoring functions based on molecular mechanics calculations. It has been used extensively to predict properties such as binding affinity, target selectivity, and substrate specificity. In this thesis, I made the first application of COMBINE analysis to derive Quantitative Structure-Kinetics Relationships (QSKRs) for the dissociation rates. I obtained models for koff of inhibitors of HIV-1 protease and heat shock protein 90 (HSP90) with very good predictive power and identified the key ligand-receptor interactions that contribute to the variance in binding kinetics. With technological and methodological advances, the use of all-atom unbiased Molecular Dynamics (MD) simulations can allow sampling upto the millisecond timescale and investigation of the kinetic profile of drug binding and unbinding to a receptor. However, the residence times of drug-receptor complexes are usually longer than the timescales that are feasible to simulate using conventional molecular dynamics techniques. Enhanced sampling methods can allow faster sampling of protein and ligand dynamics, thereby resulting in application of MD techniques to study longer timescale processes. I have evaluated the application of Tau-Random Acceleration Molecular Dynamics (Tau-RAMD), an enhanced sampling method based on MD, to compute the relative residence times of a series of compounds binding to Haspin kinase. A good correlation (R2 = 0.86) was observed between the computed residence times and the experimental residence times of these compounds. I also performed interaction energy calculations, both at the quantum chemical level and at the molecular mechanics level, to explain the experimental observation that the residence times of kinase inhibitors can be prolonged by introducing halogen-aromatic pi interactions between halogen atoms of inhibitors and aromatic residues at the binding site of kinases. I determined different energetic contributions to this highly polar and directional halogen-bonding interaction by partitioning the total interaction energy calculated at the quantum-chemical level into its constituent energy components. It was observed that the major contribution to this interaction energy comes from the correlation energy which describes second-order intermolecular dispersion interactions and the correlation corrections to the Hartree-Fock energy. In addition, a protocol to determine diffusional kon rates of low molecular weight compounds from Brownian Dynamics (BD) simulations of protein-ligand association was established using SDA 7 software. The widely studied test case of benzamidine binding to trypsin was used to evaluate a set of parameters and a robust set of optimal parameters was determined that should be generally applicable for computing the diffusional association rate constants of a wide range of protein-ligand binding pairs. I validated this protocol on inhibitors of several targets with varying complexity such as Human Coagulation Factor Xa, Haspin kinase and N1 Neuraminidase, and the computed diffusional association rate constants were compared with the experiments. I contributed to the development of a toolbox of computational methods: KBbox (http://kbbox.h-its.org/toolbox/), which provides information about various computational methods to study molecular binding kinetics, and different computational tools that employ them. It was developed to guide researchers on the use of the different computational and simulation approaches available to compute the kinetic parameters of drug-protein binding

    Place cell physiology in a transgenic mouse model of Alzheimer's disease

    Get PDF
    Alzheimer's disease (AD) is a multifactorial neurodegenerative disorder characterized by progressive cognitive impairments (Selkoe, 2001). Hippocampal place cells are a well understood candidate for the neural basis of one type of memory in rodents; these cells identify the animal's location in an environment and are crucial for spatial memory and navigation. This PhD project aims to clarify the mechanisms responsible for the cognitive deficits in AD at the hippocampal network level, by examining place cell physiology in a transgenic mouse model of AD. I have recorded place cells in tg2576 mice, and found that aged (16 months) but not young (3 months) transgenic mice show degraded neuronal representations of the environment. The level of place cell degradation correlates with the animals' (poorer) spatial memory as tested in a forced-choice spatial alternation T-maze task and with hippocampal, but not neocortical, amyloid plaque burden. Additionally, pilot data show that physiological changes of the hippocampus in tg2576 mice seem to start as early as 3 months, when no pathological and behavioural deficits are present. However, these changes are not obvious at the neuronal level, but only at the hippocampal network level, which represent hippocampal responses to environmental changes. Place cell recording provides a sensitive assay for measuring the amount and rate of functional deterioration in animal models of dementia as well as providing a quantifiable physiological indication of the beneficial effects of potential therapies
    corecore