5,933 research outputs found
Recommended from our members
Fast training of self organizing maps for the visual exploration of molecular compounds
Visual exploration of scientific data in life science
area is a growing research field due to the large amount of
available data. The Kohonen’s Self Organizing Map (SOM) is
a widely used tool for visualization of multidimensional data.
In this paper we present a fast learning algorithm for SOMs
that uses a simulated annealing method to adapt the learning
parameters. The algorithm has been adopted in a data analysis
framework for the generation of similarity maps. Such maps
provide an effective tool for the visual exploration of large and
multi-dimensional input spaces. The approach has been applied
to data generated during the High Throughput Screening
of molecular compounds; the generated maps allow a visual
exploration of molecules with similar topological properties.
The experimental analysis on real world data from the
National Cancer Institute shows the speed up of the proposed
SOM training process in comparison to a traditional approach.
The resulting visual landscape groups molecules with similar
chemical properties in densely connected regions
Recommended from our members
Context-aware visual exploration of molecular databases
Facilitating the visual exploration of scientific data has
received increasing attention in the past decade or so. Especially
in life science related application areas the amount
of available data has grown at a breath taking pace. In this
paper we describe an approach that allows for visual inspection
of large collections of molecular compounds. In
contrast to classical visualizations of such spaces we incorporate
a specific focus of analysis, for example the outcome
of a biological experiment such as high throughout
screening results. The presented method uses this experimental
data to select molecular fragments of the underlying
molecules that have interesting properties and uses the
resulting space to generate a two dimensional map based
on a singular value decomposition algorithm and a self organizing
map. Experiments on real datasets show that
the resulting visual landscape groups molecules of similar
chemical properties in densely connected regions
Recommended from our members
The BioDICE Taverna plugin for clustering and visualization of biological data: a workflow for molecular compounds exploration
Background: In many experimental pipelines, clustering of multidimensional biological datasets is used to detect
hidden structures in unlabelled input data. Taverna is a popular workflow management system that is used to design
and execute scientific workflows and aid in silico experimentation. The availability of fast unsupervised methods for clustering and visualization in the Taverna platform is important to support a data-driven scientific discovery in complex and explorative bioinformatics applications.
Results: This work presents a Taverna plugin, the Biological Data Interactive Clustering Explorer (BioDICE), that performs clustering of high-dimensional biological data and provides a nonlinear, topology preserving projection for the visualization of the input data and their similarities. The core algorithm in the BioDICE plugin is Fast Learning Self Organizing Map (FLSOM), which is an improved variant of the Self Organizing Map (SOM) algorithm. The plugin generates an interactive 2D map that allows the visual exploration of multidimensional data and the identification of groups of similar objects. The effectiveness of the plugin is demonstrated on a case study related to chemical
compounds.
Conclusions: The number and variety of available tools and its extensibility have made Taverna a popular choice for the development of scientific data workflows. This work presents a novel plugin, BioDICE, which adds a data-driven knowledge discovery component to Taverna. BioDICE provides an effective and powerful clustering tool, which can be adopted for the explorative analysis of biological datasets
Visual data mining: integrating machine learning with information visualization
Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman’s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework
Multi-tier framework for the inferential measurement and data-driven modeling
A framework for the inferential measurement and data-driven modeling has been proposed and assessed in several real-world application domains. The architecture of the framework has been structured in multiple tiers to facilitate extensibility and the integration of new components. Each of the proposed four tiers has been assessed in an uncoupled way to verify their suitability. The first tier, dealing with exploratory data analysis, has been assessed with the characterization of the chemical space related to the biodegradation of organic chemicals. This analysis has established relationships between physicochemical variables and biodegradation rates that have been used for model development. At the preprocessing level, a novel method for feature selection based on dissimilarity measures between Self-Organizing maps (SOM) has been developed and assessed. The proposed method selected more features than others published in literature but leads to models with improved predictive power. Single and multiple data imputation techniques based on the SOM have also been used to recover missing data in a Waste Water Treatment Plant benchmark. A new dynamic method to adjust the centers and widths of in Radial basis Function networks has been proposed to predict water quality. The proposed method outperformed other neural networks. The proposed modeling components have also been assessed in the development of prediction and classification models for biodegradation rates in different media. The results obtained proved the suitability of this approach to develop data-driven models when the complex dynamics of the process prevents the formulation of mechanistic models. The use of rule generation algorithms and Bayesian dependency models has been preliminary screened to provide the framework with interpretation capabilities. Preliminary results obtained from the classification of Modes of Toxic Action (MOA) indicate that this could be a promising approach to use MOAs as proxy indicators of human health effects of chemicals.Finally, the complete framework has been applied to three different modeling scenarios. A virtual sensor system, capable of inferring product quality indices from primary process variables has been developed and assessed. The system was integrated with the control system in a real chemical plant outperforming multi-linear correlation models usually adopted by chemical manufacturers. A model to predict carcinogenicity from molecular structure for a set of aromatic compounds has been developed and tested. Results obtained after the application of the SOM-dissimilarity feature selection method yielded better results than models published in the literature. Finally, the framework has been used to facilitate a new approach for environmental modeling and risk management within geographical information systems (GIS). The SOM has been successfully used to characterize exposure scenarios and to provide estimations of missing data through geographic interpolation. The combination of SOM and Gaussian Mixture models facilitated the formulation of a new probabilistic risk assessment approach.Aquesta tesi proposa i avalua en diverses aplicacions reals, un marc general de treball per al desenvolupament de sistemes de mesurament inferencial i de modelat basats en dades. L'arquitectura d'aquest marc de treball s'organitza en diverses capes que faciliten la seva extensibilitat així com la integració de nous components. Cadascun dels quatre nivells en que s'estructura la proposta de marc de treball ha estat avaluat de forma independent per a verificar la seva funcionalitat. El primer que nivell s'ocupa de l'anàlisi exploratòria de dades ha esta avaluat a partir de la caracterització de l'espai químic corresponent a la biodegradació de certs compostos orgànics. Fruit d'aquest anàlisi s'han establert relacions entre diverses variables físico-químiques que han estat emprades posteriorment per al desenvolupament de models de biodegradació. A nivell del preprocés de les dades s'ha desenvolupat i avaluat una nova metodologia per a la selecció de variables basada en l'ús del Mapes Autoorganitzats (SOM). Tot i que el mètode proposat selecciona, en general, un major nombre de variables que altres mètodes proposats a la literatura, els models resultants mostren una millor capacitat predictiva. S'han avaluat també tot un conjunt de tècniques d'imputació de dades basades en el SOM amb un conjunt de dades estàndard corresponent als paràmetres d'operació d'una planta de tractament d'aigües residuals. Es proposa i avalua en un problema de predicció de qualitat en aigua un nou model dinàmic per a ajustar el centre i la dispersió en xarxes de funcions de base radial. El mètode proposat millora els resultats obtinguts amb altres arquitectures neuronals. Els components de modelat proposat s'han aplicat també al desenvolupament de models predictius i de classificació de les velocitats de biodegradació de compostos orgànics en diferents medis. Els resultats obtinguts demostren la viabilitat d'aquesta aproximació per a desenvolupar models basats en dades en aquells casos en els que la complexitat de dinàmica del procés impedeix formular models mecanicistes. S'ha dut a terme un estudi preliminar de l'ús de algorismes de generació de regles i de grafs de dependència bayesiana per a introduir una nova capa que faciliti la interpretació dels models. Els resultats preliminars obtinguts a partir de la classificació dels Modes d'acció Tòxica (MOA) apunten a que l'ús dels MOA com a indicadors intermediaris dels efectes dels compostos químics en la salut és una aproximació factible.Finalment, el marc de treball proposat s'ha aplicat en tres escenaris de modelat diferents. En primer lloc, s'ha desenvolupat i avaluat un sensor virtual capaç d'inferir índexs de qualitat a partir de variables primàries de procés. El sensor resultant ha estat implementat en una planta química real millorant els resultats de les correlacions multilineals emprades habitualment. S'ha desenvolupat i avaluat un model per a predir els efectes carcinògens d'un grup de compostos aromàtics a partir de la seva estructura molecular. Els resultats obtinguts desprès d'aplicar el mètode de selecció de variables basat en el SOM milloren els resultats prèviament publicats. Aquest marc de treball s'ha usat també per a proporcionar una nova aproximació al modelat ambiental i l'anàlisi de risc amb sistemes d'informació geogràfica (GIS). S'ha usat el SOM per a caracteritzar escenaris d'exposició i per a desenvolupar un nou mètode d'interpolació geogràfica. La combinació del SOM amb els models de mescla de gaussianes dona una nova formulació al problema de l'anàlisi de risc des d'un punt de vista probabilístic
VANTED: A system for advanced data analysis and visualization in the context of biological networks
BACKGROUND: Recent advances with high-throughput methods in life-science research have increased the need for automatized data analysis and visual exploration techniques. Sophisticated bioinformatics tools are essential to deduct biologically meaningful interpretations from the large amount of experimental data, and help to understand biological processes. RESULTS: We present VANTED, a tool for the visualization and analysis of networks with related experimental data. Data from large-scale biochemical experiments is uploaded into the software via a Microsoft Excel-based form. Then it can be mapped on a network that is either drawn with the tool itself, downloaded from the KEGG Pathway database, or imported using standard network exchange formats. Transcript, enzyme, and metabolite data can be presented in the context of their underlying networks, e. g. metabolic pathways or classification hierarchies. Visualization and navigation methods support the visual exploration of the data-enriched networks. Statistical methods allow analysis and comparison of multiple data sets such as different developmental stages or genetically different lines. Correlation networks can be automatically generated from the data and substances can be clustered according to similar behavior over time. As examples, metabolite profiling and enzyme activity data sets have been visualized in different metabolic maps, correlation networks have been generated and similar time patterns detected. Some relationships between different metabolites were discovered which are in close accordance with the literature. CONCLUSION: VANTED greatly helps researchers in the analysis and interpretation of biochemical data, and thus is a useful tool for modern biological research. VANTED as a Java Web Start Application including a user guide and example data sets is available free of charge at
Environmental risk assessment in the mediterranean region using artificial neural networks
Los mapas auto-organizados han demostrado ser una herramienta apropiada para la clasificación y visualización de grupos de datos complejos. Redes neuronales, como los mapas auto-organizados (SOM) o las redes difusas ARTMAP (FAM), se utilizan en este estudio para evaluar el impacto medioambiental acumulativo en diferentes medios (aguas subterráneas, aire y salud humana). Los SOMs también se utilizan para generar mapas de concentraciones de contaminantes en aguas subterráneas simulando las técnicas geostadísticas de interpolación como kriging y cokriging. Para evaluar la confiabilidad de las metodologías desarrolladas en esta tesis, se utilizan procedimientos de referencia como puntos de comparación: la metodología DRASTIC para el estudio de vulnerabilidad en aguas subterráneas y el método de interpolación espacio-temporal conocido como Bayesian Maximum Entropy (BME) para el análisis de calidad del aire.
Esta tesis contribuye a demostrar las capacidades de las redes neuronales en el desarrollo de nuevas metodologías y modelos que explícitamente permiten evaluar las dimensiones temporales y espaciales de riesgos acumulativos
Data analysis and navigation in high-dimensional chemical and biological spaces
The goal of this master thesis is to develop and validate a visual data-mining
approach suitable for the screening of chemicals in the context of REACH [Registration, Evaluation, Authorization and
Restriction of Chemicals]. The
proposed approach will facilitate the development and validation of non-testing
methods via the exploration of environmental endpoints and their relationship with
the chemical structure and physicochemical properties of chemicals.
The use of an interactive chemical space data exploration tool using 3D visualization
and navigation will enrich the information available with additional variables like
size, texture and color of the objects of the scene (compounds). The features that
distinguish this approach and make it unique are (i) the integration of multiple data
sources allowing the recovery in real time of complementary information of the
studied compounds, (ii) the integration of several algorithms for the data analysis
(dimensional reduction, generation of composite variables and clustering) and (iii)
direct user interaction with the data through the virtual navigation mechanism. All
this is achieved without the need for specialized hardware or the use of specific
devices and high-cost virtual reality and mixed reality
- …