7 research outputs found

    GPU-accelerated Chemical Similarity Assessment for Large Scale Databases

    Get PDF
    The assessment of chemical similarity between molecules is a basic operation in chemoinformatics, a computational area concerning with the manipulation of chemical structural information. Comparing molecules is the basis for a wide range of applications such as searching in chemical databases, training prediction models for virtual screening or aggregating clusters of similar compounds. However, currently available multimillion databases represent a challenge for conventional chemoinformatics algorithms raising the necessity for faster similarity methods. In this paper, we extensively analyze the advantages of using many-core architectures for calculating some commonly-used chemical similarity coe_cients such as Tanimoto, Dice or Cosine. Our aim is to provide a wide-breath proof-of-concept regarding the usefulness of GPU architectures to chemoinformatics, a class of computing problems still uncovered. In our work, we present a general GPU algorithm for all-to-all chemical comparisons considering both binary fingerprints and floating point descriptors as molecule representation. Subsequently, we adopt optimization techniques to minimize global memory accesses and to further improve e_ciency. We test the proposed algorithm on different experimental setups, a laptop with a low-end GPU and a desktop with a more performant GPU. In the former case, we obtain a 4-to-6-fold speed-up over a single-core implementation for fingerprints and a 4-to-7-fold speed-up for descriptors. In the latter case, we respectively obtain a 195-to-206-fold speed-up and a 100-to-328-fold speed-up.National Institutes of Health (U.S.) (grant GM079804)National Institutes of Health (U.S.) (grant GM086145

    To Calibrate & Validate an Agent-Based Simulation Model - An Application of the Combination Framework of BI solution & Multi-agent platform

    Get PDF
    National audienceIntegrated environmental modeling approaches, especially the agent-based modeling one, are increasingly used in large-scale decision support systems. A major consequence of this trend is the manipulation and generation of huge amount of data in simulations, which must be efficiently managed. Furthermore, calibration and validation are also challenges for Agent-Based Modelling and Simulation (ABMS) approaches when the model has to work with integrated systems involving high volumes of input/output data. In this paper, we propose a calibration and validation approach for an agent-based model, using a Combination Framework of Business intelligence solution and Multi-agent platform (CFBM). The CFBM is a logical framework dedicated to the management of the input and output data in simulations, as well as the corresponding empirical datasets in an integrated way. The calibration and validation of Brown Plant Hopper Prediction model are presented and used throughout the paper as a case study to illustrate the way CFBM manages the data used and generated during the life-cycle of simulation and validation

    Combining social-based data mining techniques to extract collective trends from twitter

    Full text link
    Social Networks have become an important environment for Collective Trends extraction. The interactions amongst users provide information of their preferences and relationships. This information can be used to measure the influence of ideas, or opinions, and how they are spread within the Network. Currently, one of the most relevant and popular Social Networks is Twitter. This Social Network was created to share comments and opinions. The information provided by users is especially useful in different fields and research areas such as marketing. This data is presented as short text strings containing different ideas expressed by real people. With this representation, different Data Mining techniques (such as classification or clustering) will be used for knowledge extraction to distinguish the meaning of the opinions. Complex Network techniques are also helpful to discover influential actors and study the information propagation inside the Social Network. This work is focused on how clustering and classification techniques can be combined to extract collective knowledge from Twitter. In an initial phase, clustering techniques are applied to extract the main topics from the user opinions. Later, the collective knowledge extracted is used to relabel the dataset according to the clusters obtained to improve the classification results. Finally, these results are compared against a dataset which has been manually labelled by human experts to analyse the accuracy of the proposed method.The preparation of this manuscript has been supported by the Spanish Ministry of Science and Innovation under the following projects: TIN2010-19872 and ECO2011-30105 (National Plan for Research, Development and Innovation), as well as the Multidisciplinary Project of Universidad AutĂłnoma de Madrid (CEMU2012-034). The authors thank Ana M. DĂ­az-MartĂ­n and Mercedes Rozano for the manual classification of the Tweets

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

    Get PDF
    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    To Develop a Database Management Tool for Multi-Agent Simulation Platform

    Get PDF
    Depuis peu, la Modélisation et Simulation par Agents (ABMs) est passée d'une approche dirigée par les modèles à une approche dirigée par les données (Data Driven Approach, DDA). Cette tendance vers l’utilisation des données dans la simulation vise à appliquer les données collectées par les systèmes d’observation à la simulation (Edmonds and Moss, 2005; Hassan, 2009). Dans la DDA, les données empiriques collectées sur les systèmes cibles sont utilisées non seulement pour la simulation des modèles mais aussi pour l’initialisation, la calibration et l’évaluation des résultats issus des modèles de simulation, par exemple, le système d’estimation et de gestion des ressources hydrauliques du bassin Adour-Garonne Français (Gaudou et al., 2013) et l’invasion des rizières du delta du Mékong au Vietnam par les cicadelles brunes (Nguyen et al., 2012d). Cette évolution pose la question du « comment gérer les données empiriques et celles simulées dans de tels systèmes ». Le constat que l’on peut faire est que, si la conception et la simulation actuelles des modèles ont bénéficié des avancées informatiques à travers l’utilisation des plateformes populaires telles que Netlogo (Wilensky, 1999) ou GAMA (Taillandier et al., 2012), ce n'est pas encore le cas de la gestion des données, qui sont encore très souvent gérées de manière ad-hoc. Cette gestion des données dans des Modèles Basés Agents (ABM) est une des limitations actuelles des plateformes de simulation multiagents (SMA). Autrement dit, un tel outil de gestion des données est actuellement requis dans la construction des systèmes de simulation par agents et la gestion des bases de données correspondantes est aussi un problème important de ces systèmes. Dans cette thèse, je propose tout d’abord une structure logique pour la gestion des données dans des plateformes de SMA. La structure proposée qui intègre des solutions de l’Informatique Décisionnelle et des plateformes multi-agents s’appelle CFBM (Combination Framework of Business intelligence and Multi-agent based platform), elle a plusieurs objectifs : (1) modéliser et exécuter des SMAs, (2) gérer les données en entrée et en sortie des simulations, (3) intégrer les données de différentes sources, et (4) analyser les données à grande échelle. Ensuite, le besoin de la gestion des données dans les simulations agents est satisfait par une implémentation de CFBM dans la plateforme GAMA. Cette implémentation présente aussi une architecture logicielle pour combiner entrepôts deIv données et technologies du traitement analytique en ligne (OLAP) dans les systèmes SMAs. Enfin, CFBM est évaluée pour la gestion de données dans la plateforme GAMA à travers le développement de modèles de surveillance des cicadelles brunes (BSMs), où CFBM est utilisé non seulement pour gérer et intégrer les données empiriques collectées depuis le système cible et les résultats de simulation du modèle simulé, mais aussi calibrer et valider ce modèle. L'intérêt de CFBM réside non seulement dans l'amélioration des faiblesses des plateformes de simulation et de modélisation par agents concernant la gestion des données mais permet également de développer des systèmes de simulation complexes portant sur de nombreuses données en entrée et en sortie en utilisant l’approche dirigée par les données.Recently, there has been a shift from modeling driven approach to data driven approach inAgent Based Modeling and Simulation (ABMS). This trend towards the use of data-driven approaches in simulation aims at using more and more data available from the observation systems into simulation models (Edmonds and Moss, 2005; Hassan, 2009). In a data driven approach, the empirical data collected from the target system are used not only for the design of the simulation models but also in initialization, calibration and evaluation of the output of the simulation platform such as e.g., the water resource management and assessment system of the French Adour-Garonne Basin (Gaudou et al., 2013) and the invasion of Brown Plant Hopper on the rice fields of Mekong River Delta region in Vietnam (Nguyen et al., 2012d). That raises the question how to manage empirical data and simulation data in such agentbased simulation platform. The basic observation we can make is that currently, if the design and simulation of models have benefited from advances in computer science through the popularized use of simulation platforms like Netlogo (Wilensky, 1999) or GAMA (Taillandier et al., 2012), this is not yet the case for the management of data, which are still often managed in an ad hoc manner. Data management in ABM is one of limitations of agent-based simulation platforms. Put it other words, such a database management is also an important issue in agent-based simulation systems. In this thesis, I first propose a logical framework for data management in multi-agent based simulation platforms. The proposed framework is based on the combination of Business Intelligence solution and a multi-agent based platform called CFBM (Combination Framework of Business intelligence and Multi-agent based platform), and it serves several purposes: (1) model and execute multi-agent simulations, (2) manage input and output data of simulations, (3) integrate data from different sources; and (4) analyze high volume of data. Secondly, I fulfill the need for data management in ABM by the implementation of CFBM in the GAMA platform. This implementation of CFBM in GAMA also demonstrates a software architecture to combine Data Warehouse (DWH) and Online Analytical Processing (OLAP) technologies into a multi-agent based simulation system. Finally, I evaluate the CFBM for data management in the GAMA platform via the development of a Brown Plant Hopper Surveillance Models (BSMs), where CFBM is used ii not only to manage and integrate the whole empirical data collected from the target system and the data produced by the simulation model, but also to calibrate and validate the models.The successful development of the CFBM consists not only in remedying the limitation of agent-based modeling and simulation with regard to data management but also in dealing with the development of complex simulation systems with large amount of input and output data supporting a data driven approach

    To Develop a Database Management Tool for Multi-Agent Simulation Platform

    Get PDF
    Depuis peu, la Modélisation et Simulation par Agents (ABMs) est passée d'une approche dirigée par les modèles à une approche dirigée par les données (Data Driven Approach, DDA). Cette tendance vers l’utilisation des données dans la simulation vise à appliquer les données collectées par les systèmes d’observation à la simulation (Edmonds and Moss, 2005; Hassan, 2009). Dans la DDA, les données empiriques collectées sur les systèmes cibles sont utilisées non seulement pour la simulation des modèles mais aussi pour l’initialisation, la calibration et l’évaluation des résultats issus des modèles de simulation, par exemple, le système d’estimation et de gestion des ressources hydrauliques du bassin Adour-Garonne Français (Gaudou et al., 2013) et l’invasion des rizières du delta du Mékong au Vietnam par les cicadelles brunes (Nguyen et al., 2012d). Cette évolution pose la question du « comment gérer les données empiriques et celles simulées dans de tels systèmes ». Le constat que l’on peut faire est que, si la conception et la simulation actuelles des modèles ont bénéficié des avancées informatiques à travers l’utilisation des plateformes populaires telles que Netlogo (Wilensky, 1999) ou GAMA (Taillandier et al., 2012), ce n'est pas encore le cas de la gestion des données, qui sont encore très souvent gérées de manière ad-hoc. Cette gestion des données dans des Modèles Basés Agents (ABM) est une des limitations actuelles des plateformes de simulation multiagents (SMA). Autrement dit, un tel outil de gestion des données est actuellement requis dans la construction des systèmes de simulation par agents et la gestion des bases de données correspondantes est aussi un problème important de ces systèmes. Dans cette thèse, je propose tout d’abord une structure logique pour la gestion des données dans des plateformes de SMA. La structure proposée qui intègre des solutions de l’Informatique Décisionnelle et des plateformes multi-agents s’appelle CFBM (Combination Framework of Business intelligence and Multi-agent based platform), elle a plusieurs objectifs : (1) modéliser et exécuter des SMAs, (2) gérer les données en entrée et en sortie des simulations, (3) intégrer les données de différentes sources, et (4) analyser les données à grande échelle. Ensuite, le besoin de la gestion des données dans les simulations agents est satisfait par une implémentation de CFBM dans la plateforme GAMA. Cette implémentation présente aussi une architecture logicielle pour combiner entrepôts deIv données et technologies du traitement analytique en ligne (OLAP) dans les systèmes SMAs. Enfin, CFBM est évaluée pour la gestion de données dans la plateforme GAMA à travers le développement de modèles de surveillance des cicadelles brunes (BSMs), où CFBM est utilisé non seulement pour gérer et intégrer les données empiriques collectées depuis le système cible et les résultats de simulation du modèle simulé, mais aussi calibrer et valider ce modèle. L'intérêt de CFBM réside non seulement dans l'amélioration des faiblesses des plateformes de simulation et de modélisation par agents concernant la gestion des données mais permet également de développer des systèmes de simulation complexes portant sur de nombreuses données en entrée et en sortie en utilisant l’approche dirigée par les données.Recently, there has been a shift from modeling driven approach to data driven approach inAgent Based Modeling and Simulation (ABMS). This trend towards the use of data-driven approaches in simulation aims at using more and more data available from the observation systems into simulation models (Edmonds and Moss, 2005; Hassan, 2009). In a data driven approach, the empirical data collected from the target system are used not only for the design of the simulation models but also in initialization, calibration and evaluation of the output of the simulation platform such as e.g., the water resource management and assessment system of the French Adour-Garonne Basin (Gaudou et al., 2013) and the invasion of Brown Plant Hopper on the rice fields of Mekong River Delta region in Vietnam (Nguyen et al., 2012d). That raises the question how to manage empirical data and simulation data in such agentbased simulation platform. The basic observation we can make is that currently, if the design and simulation of models have benefited from advances in computer science through the popularized use of simulation platforms like Netlogo (Wilensky, 1999) or GAMA (Taillandier et al., 2012), this is not yet the case for the management of data, which are still often managed in an ad hoc manner. Data management in ABM is one of limitations of agent-based simulation platforms. Put it other words, such a database management is also an important issue in agent-based simulation systems. In this thesis, I first propose a logical framework for data management in multi-agent based simulation platforms. The proposed framework is based on the combination of Business Intelligence solution and a multi-agent based platform called CFBM (Combination Framework of Business intelligence and Multi-agent based platform), and it serves several purposes: (1) model and execute multi-agent simulations, (2) manage input and output data of simulations, (3) integrate data from different sources; and (4) analyze high volume of data. Secondly, I fulfill the need for data management in ABM by the implementation of CFBM in the GAMA platform. This implementation of CFBM in GAMA also demonstrates a software architecture to combine Data Warehouse (DWH) and Online Analytical Processing (OLAP) technologies into a multi-agent based simulation system. Finally, I evaluate the CFBM for data management in the GAMA platform via the development of a Brown Plant Hopper Surveillance Models (BSMs), where CFBM is used ii not only to manage and integrate the whole empirical data collected from the target system and the data produced by the simulation model, but also to calibrate and validate the models.The successful development of the CFBM consists not only in remedying the limitation of agent-based modeling and simulation with regard to data management but also in dealing with the development of complex simulation systems with large amount of input and output data supporting a data driven approach
    corecore