591 research outputs found

    ensembles of probabilistic principal surfaces and competitive evolution on data two different approaches to data classification

    Get PDF
    Probabilistic Principal Surfaces (PPS) offer very powerful visualization and classification capabilities and overcome most of the shortcomings of other neural tools such as SOM, GTM, etc. More specifically PPS build a probability density function of a given data set of patterns lying in a D-dimensional space (with D ā‰« 3) which can be expressed in terms of a limited number of latent variables laying in a Q-dimensional space (Q is usually 2-3) which can be used to visualize the data in the latent space. PPS may also be arranged in ensembles to tackle very complex classification tasks. Competitive Evolution on Data (CED) is instead an evolutionary system in which the possible solutions (cluster centroids) compete to conquer the largest possible number of resources (data) and thus partition the input data set in clusters. We discuss the application of Spherical-PPS to two data sets coming, respectively, from astronomy (Great Observatory Origins Deep Survey) and from genetics (microarray data from yeast genoma) and of CED to the genetics data only

    Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling

    Full text link
    The U.S. Department of Energy recently announced the first five grants for the Genomes to Life (GTL) Program. The goal of this program is to "achieve the most far-reaching of all biological goals: a fundamental, comprehensive, and systematic understanding of life." While more information about the program can be found at the GTL website (www.doegenomestolife.org), this paper provides an overview of one of the five GTL projects funded, "Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling." This project is a combined experimental and computational effort emphasizing developing, prototyping, and applying new computational tools and methods to ellucidate the biochemical mechanisms of the carbon sequestration of Synechococcus Sp., an abundant marine cyanobacteria known to play an important role in the global carbon cycle. Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major focus of biological oceanography and has more recently been of interest to a broader audience of scientists and policy makers. It is clear that the oceanic sinks and sources of CO2 are important terms in the global environmental response to anthropogenic atmospheric inputs of CO2 and that oceanic microorganisms play a key role in this response. However, the relationship between this global phenomenon and the biochemical mechanisms of carbon fixation in these microorganisms is poorly understood. The project includes five subprojects: an experimental investigation, three computational biology efforts, and a fifth which deals with addressing computational infrastructure challenges of relevance to this project and the Genomes to Life program as a whole. Our experimental effort is designed to provide biology and data to drive the computational efforts and includes significant investment in developing new experimental methods for uncovering protein partners, characterizing protein complexes, identifying new binding domains. We will also develop and apply new data measurement and statistical methods for analyzing microarray experiments. Our computational efforts include coupling molecular simulation methods with knowledge discovery from diverse biological data sets for high-throughput discovery and characterization of protein-protein complexes and developing a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information through the integration of computational and experimental technologies. These capabilities will be applied to Synechococcus regulatory pathways to characterize their interaction map and identify component proteins in these pathways. We will also investigate methods for combining experimental and computational results with visualization and natural language tools to accelerate discovery of regulatory pathways. Furthermore, given that the ultimate goal of this effort is to develop a systems-level of understanding of how the Synechococcus genome affects carbon fixation at the global scale, we will develop and apply a set of tools for capturing the carbon fixation behavior of complex of Synechococcus at different levels of resolution. Finally, because the explosion of data being produced by high-throughput experiments requires data analysis and models which are more computationally complex, more heterogeneous, and require coupling to ever increasing amounts of experimentally obtained data in varying formats, we have also established a companion computational infrastructure to support this effort as well as the Genomes to Life program as a whole.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/63164/1/153623102321112746.pd

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (KĆ¼ffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (KĆ¼ffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Clustering Algorithms for Microarray Data Mining

    Get PDF
    This thesis presents a systems engineering model of modern drug discovery processes and related systems integration requirements. Some challenging problems include the integration of public information content with proprietary corporate content, supporting different types of scientific analyses, and automated analysis tools motivated by diverse forms of biological data.To capture the requirements of the discovery system, we identify the processes, users, and scenarios to form a UML use case model. We then define the object-oriented system structure and attach behavioral elements. We also look at how object-relational database extensions can be applied for such analysis.The next portion of the thesis studies the performance of clustering algorithms based on LVQ, SVMs, and other machine learning algorithms, to two types of analyses - functional and phenotypic classification. We found that LVQ initialized with the LBG codebook yields comparable performance to the optimal separating surfaces generated by related SVM kernels. We also describe a novel similarity measure, called the unnormalized symmetric Kullback-Liebler measure, based on unnormalized expression values. Since the Mercer criterion cannot be applied to this measure, we compared the performance of this similarity measure with the log-Euclidean distance in the LVQ algorithm.The two distance measures perform similarly on cDNA arrays, while the unnormalized symmetric Kullback-Liebler measure outperforms the log-Euclidean distance on certain phenotypic classification problems. Pre-filtering algorithms to find discriminating instances based on PCA, the Find Similar function, and IB3 were also investigated. The Find Similar method gives the best performance in terms of multiple criteria

    Postgenomics: Proteomics and Bioinformatics in Cancer Research

    Get PDF
    Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the ā€œproteome,ā€ referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus nondiseased states. These technologies provide a wealth of information and rapidly generate large quantities of data. Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and identification of metabolic disorders. Here, we present an overview of the current status and future research approaches in defining the cancer cell's proteome in combination with different bioinformatics and computational biology tools toward a better understanding of health and disease

    A Review of Principal Component Analysis Algorithm for Dimensionality Reduction

    Get PDF
    Big databases are increasingly widespread and are therefore hard to understand, in exploratory biomedicine science, big data in health research is highly exciting because data-based analyses can travel quicker than hypothesis-based research. Principal Component Analysis (PCA) is a method to reduce the dimensionality of certain datasets. Improves interpretability but without losing much information. It achieves this by creating new covariates that are not related to each other. Finding those new variables, or what we call the main components, will reduce the eigenvalue /eigenvectors solution problem. (PCA) can be said to be an adaptive data analysis technology because technology variables are developed to adapt to different data types and structures. This review will start by introducing the basic ideas of (PCA), describe some concepts related to (PCA), and discussing. What it can do, and reviewed fifteen articles of (PCA) that have been introduced and published in the last three years

    A primer on molecular biology

    No full text
    Modern molecular biology provides a rich source of challenging machine learning problems. This tutorial chapter aims to provide the necessary biological background knowledge required to communicate with biologists and to understand and properly formalize a number of most interesting problems in this application domain. The largest part of the chapter (its first section) is devoted to the cell as the basic unit of life. Four aspects of cells are reviewed in sequence: (1) the molecules that cells make use of (above all, proteins, RNA, and DNA); (2) the spatial organization of cells (``compartmentalization''); (3) the way cells produce proteins (``protein expression''); and (4) cellular communication and evolution (of cells and organisms). In the second section, an overview is provided of the most frequent measurement technologies, data types, and data sources. Finally, important open problems in the analysis of these data (bioinformatics challenges) are briefly outlined

    Development of mathematical methods for modeling biological systems

    Get PDF
    • ā€¦
    corecore