87 research outputs found

    Adaptive matrix metrics for molecular descriptor assessment in QSPR classification

    Get PDF
    QSPR methods represent a useful approach in the drug discovery process, since they allow to predict in advance biological or physicochemical properties of a candidate drug. For this goal, it is necessary that the QSPR method be as accurate as possible to provide reliable predictions. Moreover, the selection of the molecular descriptors is an important task to create QSPR prediction models of low complexity which, at the same time, provide accurate predictions. In this work, a matrix-based method is used to transform the original data space of chemical compounds into an alternative space where compounds with different target properties can be better separated. For using this approach, QSPR is considered as a classification problem. The advantage of using adaptive matrix metrics is twofold: it can be used to identify important molecular descriptors and at the same time it allows improving the classification accuracy. A recently proposed method making use of this concept is extended to multi-class data. The new method is related to linear discriminant analysis and shows better results at yet higher computational costs. An application for relating chemical descriptors to hydrophobicity property shows promising results.Fil: Soto, Axel Juan. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - BahĂ­a Blanca. Planta Piloto de IngenierĂ­a QuĂ­mica. Universidad Nacional del Sur. Planta Piloto de IngenierĂ­a QuĂ­mica; ArgentinaFil: Strickert, Marc. Leibniz Institute of Plant Genetics and Crop Plant Research; AlemaniaFil: Vazquez, Gustavo Esteban. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - BahĂ­a Blanca. Planta Piloto de IngenierĂ­a QuĂ­mica. Universidad Nacional del Sur. Planta Piloto de IngenierĂ­a QuĂ­mica; Argentin

    Correlation-based Data Representation

    Get PDF
    The Dagstuhl Seminar \u27Similarity-based Clustering and its Application to Medicine and Biology\u27 (07131) held in March 25--30, 2007, provided an excellent atmosphere for in-depth discussions about the research frontier of computational methods for relevant applications of biomedical clustering and beyond. We address some highlighted issues about correlation-based data analysis in this seminar postribution. First, some prominent correlation measures are briefly revisited. Then, a focus is put on Pearson correlation, because of its widespread use in biomedical sciences and because of its analytic accessibility. A connection to Euclidean distance of z-score transformed data outlined. Cost function optimization of correlation-based data representation is discussed for which, finally, applications to visualization and clustering of gene expression data are given

    Visualization of Processes in Self-Learning Systems

    Get PDF
    One aspect of self-organizing systems is their desired ability to be self-learning, i.e., to be able to adapt dynamically to conditions in their environment. This quality is awkward especially if it comes to applications in security or safety-sensitive areas. Here a step towards more trustful systems could be taken by providing transparency of the processes of a system. An important means of giving feedback to an operator is the visualization of the internal processes of a system. In this position paper we address the problem of visualizing dynamic processes especially in self-learning systems. We take an existing self-learning system from the field of computer vision as an example from which we derive questions of general interest such as possible options to visualize the flow of information in a dynamic learning system or the visualization of symbolic data. As a side effect the visualization of learning processes may provide a better understanding of underlying principles of learning in general, i.e, also in biological systems. That may also facilitate improved designs of future self-learning systems

    Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Micro- and macroarray technologies help acquire thousands of gene expression patterns covering important biological processes during plant ontogeny. Particularly, faithful visualization methods are beneficial for revealing interesting gene expression patterns and functional relationships of coexpressed genes. Such screening helps to gain deeper insights into regulatory behavior and cellular responses, as will be discussed for expression data of developing barley endosperm tissue. For that purpose, high-throughput multidimensional scaling (HiT-MDS), a recent method for similarity-preserving data embedding, is substantially refined and used for (a) assessing the quality and reliability of centroid gene expression patterns, and for (b) derivation of functional relationships of coexpressed genes of endosperm tissue during barley grain development (0–26 days after flowering).</p> <p>Results</p> <p>Temporal expression profiles of 4824 genes at 14 time points are faithfully embedded into two-dimensional displays. Thereby, similar shapes of coexpressed genes get closely grouped by a correlation-based similarity measure. As a main result, by using power transformation of correlation terms, a characteristic cloud of points with bipolar sandglass shape is obtained that is inherently connected to expression patterns of pre-storage, intermediate and storage phase of endosperm development.</p> <p>Conclusion</p> <p>The new HiT-MDS-2 method helps to create global views of expression patterns and to validate centroids obtained from clustering programs. Furthermore, functional gene annotation for developing endosperm barley tissue is successfully mapped to the visualization, making easy localization of major centroids of enriched functional categories possible.</p

    Genotyping by sequencing and a newly developed mRNA-GBS approach to link population genetic and transcriptome analyses reveal pattern differences between sites and treatments in red clover (Trifolium pratense L.)

    Get PDF
    The important worldwide forage crop red clover (Trifolium pratense L.) is widely cultivated as cattle feed and for soil improvement. Wild populations and landraces have great natural diversity that could be used to improve cultivated red clover. However, to date, there is still insufficient knowledge about the natural genetic and phenotypic diversity of the species. Here, we developed a low-cost complexity reduced mRNA analysis (mRNA-GBS) and compared the results with population genetic (GBS) and previously published mRNA-Seq data, to assess whether analysis of intraspecific variation within and between populations and transcriptome responses is possible simultaneously. The mRNA-GBS approach was successful. SNP analyses from the mRNA-GBS approach revealed comparable patterns to the GBS results, but due to site-specific multifactorial influences of environmental responses as well as conceptual and methodological limitations of mRNA-GBS, it was not possible to link transcriptome analyses with reduced complexity and sequencing depth to previously published greenhouse and field expression studies. Nevertheless, the use of short sequences upstream of the poly(A) tail of mRNA to reduce complexity are promising approaches that combine population genetics and expression profiling to analyze many individuals with trait differences simultaneously and cost-effectively, even in non-model species. Nevertheless, our study design across different regions in Germany was also challenging. The use of reduced complexity differential expression analyses most likely overlays site-specific patterns due to highly complex plant responses under natural conditions

    Comprehensive transcriptome analysis unravels the existence of crucial genes regulating primary metabolism during adventitious root formation in Petunia hybrida

    Get PDF
    To identify specific genes determining the initiation and formation of adventitious roots (AR), a microarray-based transcriptome analysis in the stem base of the cuttings of Petunia hybrida (line W115) was conducted. A microarray carrying 24,816 unique, non-redundant annotated sequences was hybridized to probes derived from different stages of AR formation. After exclusion of wound-responsive and root-regulated genes, 1,354 of them were identified which were significantly and specifically induced during various phases of AR formation. Based on a recent physiological model distinguishing three metabolic phases in AR formation, the present paper focuses on the response of genes related to particular metabolic pathways. Key genes involved in primary carbohydrate metabolism such as those mediating apoplastic sucrose unloading were induced at the early sink establishment phase of AR formation. Transcriptome changes also pointed to a possible role of trehalose metabolism and SnRK1 (sucrose non-fermenting 1- related protein kinase) in sugar sensing during this early step of AR formation. Symplastic sucrose unloading and nucleotide biosynthesis were the major processes induced during the later recovery and maintenance phases. Moreover, transcripts involved in peroxisomal beta-oxidation were up-regulated during different phases of AR formation. In addition to metabolic pathways, the analysis revealed the activation of cell division at the two later phases and in particular the induction of G1- specific genes in the maintenance phase. Furthermore, results point towards a specific demand for certain mineral nutrients starting in the recovery phase

    Unifying generative and discriminative learning principles

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.</p> <p>Results</p> <p>Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.</p> <p>Conclusions</p> <p>We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.</p

    De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference

    Get PDF
    Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom

    Parsimonious Higher-Order Hidden Markov Models for Improved Array-CGH Analysis with Applications to Arabidopsis thaliana

    Get PDF
    Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM)

    Self-Organizing Neural Networks for Sequence Processing

    No full text
    This work investigates the self-organizing representation of temporal data in prototype-based neural networks. Extensions of the supervised learning vector quantization (LVQ) and the unsupervised self-organizing map (SOM) are considered in detail. The principle of Hebbian learning through prototypes yields compact data models that can be easily interpreted by similarity reasoning. In order to obtain a robust prototype dynamic, LVQ is extended by neighborhood cooperation between neurons to prevent a strong dependence on the initial prototype locations. Additionally, implementations of more general, adaptive metrics are studied with a particular focus on the built-in detection of data attributes involved for a given classifcation task. For unsupervised sequence processing, two modifcations of SOM are pursued: the SOM for structured data (SOMSD) realizing an efficient back-reference to the previous best matching neuron in a triangular low-dimensional neural lattice, and the merge SOM (MSOM) expressing the temporal context as a fractal combination of the previously most active neuron and its context. The first SOMSD extension tackles data dimension reduction and planar visualization, the second MSOM is designed for obtaining higher quantization accuracy. The supplied experiments underline the data modeling quality of the presented methods
    • …
    corecore