239 research outputs found

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    A novel computational system for identification of biological processes from multi-dimensional high-throughput genomic data

    Get PDF
    Identifying potential toxicity signaling pathways could guide future animal studies and support human risk assessment and intervention efforts. This thesis describes a novel computational approach for identifying biological processes and pathways that are significantly associated with a disease pathology from time series, dose response, gene expression data.;Our system employs a novel constrained non-negative matrix factorization algorithm and Monte Carlo Markov chain simulation to identify underlying patterns in mRNA gene expression data. Quantitative pathology can be used as a pattern constraint. The found patterns can be thought of as functions that influence a gene\u27s expression. Using a database of curated gene sets, we can identify biological processes that are significantly related to a pathology.;We also developed a computational model for integrating miRNA with mRNA time series microarray data along with disease pathology. The dynamic temporal regulatory effects of miRNA are not well known and a single miRNA may regulate many mRNA. The integrated analysis includes identifying both mRNA and miRNA that are significantly similar to the quantitative pathology. Potential regulatory miRNA/mRNA target pairs are then identified through databases of both predicted and validated pairs. Finally, potential target pairs are filtered, keeping only pairs that demonstrate regulatory effects in the expression data.;Multi-walled carbon nanotubes (MWCNT) are known for their transient inflammatory and progressive fibrotic pulmonary effects; however, the mechanisms underlying these pathologies are unknown. In this thesis, we used time series microarray data of global lung mRNA and miRNA expression isolated from 160 C57BL/6J mice exposed by pharyngeal aspiration to vehicle or 10, 20, 40, or 80 mug MWCNT at 1, 7, 28, or 56 days post-exposure. Quantitative pathology patterns of MWCNT-induced inflammation (bronchoalveolar lavage score) and fibrosis (Sirius Red staining, quantitative morphometric analysis) were obtained from separate studies.;Understanding the regulatory networks between mRNA and miRNA in different stages would be beneficial for understanding the complex path of disease development. These identified genes and pathways may be useful for determining biomarkers of MWCNT-induced lung inflammation and fibrosis for early detection of disease. Our computational approach detects biologically relevant processes with and without pathology information. The identified significant processes and genes are supported by evidence in the literature and with biological validation

    Computational approaches to understanding infectious disease

    Full text link
    Infectious diseases derive from organisms such as viruses, bacteria, fungi and parasites that can be passed from person to person, transmitted via bites from insects or animals, or acquired through ingestion of contaminated food or water or environmental exposure. Infectious diseases cause roughly 20% of annual deaths worldwide, including many children under the age of five. In developing countries, these diseases remain a major public health problem. They can also cause societal and economic burdens through life-long disability. We need a better understanding of these diseases with a view towards the goals of prevention and cure. The advent of whole-genome transcriptional profiling technology and powerful computational resources has made it possible to study infectious diseases on a genome-wide scale. Such studies can lead to improvements in diagnostic tools as well as preventive measures such as vaccines. The work of this thesis focuses on a number of projects with the common thread of developing and applying of computational methods to extract biological information from high-throughput transcriptional data related to infectious diseases. These include (1) the identification of gene signatures related to B-cell proliferation that predict an influenza vaccine-induced antibody response; (2) study of the physiological state of the Plasmodium falciparum malaria parasite when sequestered in human tissue; (3) identifying the similarity and differences of the response to five anti-viral vaccines. To achieve the scientific goals of these projects I developed two new computational methods that can be utilized more broadly for the downstream interpretation of results from enrichment analyses of whole transcriptome profiles. There are a combined visualization and annotation approach called the Constellation Map and the Leading Edge Metagene Detector that systematically consolidates functionally related genes from multiple sets representing highly enriched biological pathways and processes in the comparison of expression data of two biological phenotypes. The application of those computational approaches and tools in this dissertation enabled a better understanding of the biological mechanisms related to human vaccine response. The software packages developed are freely available for use by biological investigators across many fields

    Integrative Analysis Methods for Biological Problems Using Data Reduction Approaches

    Full text link
    The "big data" revolution of the past decade has allowed researchers to procure or access biological data at an unprecedented scale, on the front of both volume (low-cost high-throughput technologies) and variety (multi-platform genomic profiling). This has fueled the development of new integrative methods, which combine and consolidate across multiple sources of data in order to gain generalizability, robustness, and a more comprehensive systems perspective. The key challenges faced by this new class of methods primarily relate to heterogeneity, whether it is across cohorts from independent studies or across the different levels of genomic regulation. While the different perspectives among data sources is invaluable in providing different snapshots of the global system, such diversity also brings forth many analytic difficulties as each source introduces a distinctive element of noise. In recent years, many styles of data integration have appeared to tackle this problem ranging from Bayesian frameworks to graphical models, a wide assortment as diverse as the biology they intend to explain. My focus in this work is dimensionality reduction-based methods of integration, which offer the advantages of efficiency in high-dimensions (an asset among genomic datasets) and simplicity in allowing for elegant mathematical extensions. In the course of these chapters I will describe the biological motivations, the methodological directions, and the applications of three canonical reductionist approaches for relating information across multiple data groups.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138564/1/yangzi_1.pd

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Whole transcriptomic network analysis using Co-expression Differential Network Analysis (CoDiNA)

    Get PDF
    Biological and medical sciences are increasingly acknowledging the significance of gene co-expression-networks for investigating complex-systems, phenotypes or diseases. Typically, complex phenotypes are investigated under varying conditions. While approaches for comparing nodes and links in two networks exist, almost no methods for the comparison of multiple networks are available and-to best of our knowledge-no comparative method allows for whole transcriptomic network analysis. However, it is the aim of many studies to compare networks of different conditions, for example, tissues, diseases, treatments, time points, or species. Here we present a method for the systematic comparison of an unlimited number of networks, with unlimited number of transcripts:Co-expression Differential Network Analysis (CoDiNA). In particular, CoDiNA detects linksandnodes that are common, specific or different among the networks. We developed a statistical framework to normalize between these different categories of common or changed network links and nodes, resulting in a comprehensive network analysis method, more sophisticated than simply comparing the presence or absence of network nodes. Applying CoDiNA to a neurogenesis study we identified candidate genes involved in neuronal differentiation. We experimentally validated one candidate, demonstrating that its overexpression resulted in a significant disturbance in the underlying gene regulatory network of neurogenesis. Using clinical studies, we compared whole transcriptome co-expression networks from individuals with or without HIV and active tuberculosis (TB) and detected signature genes specific to HIV. Furthermore, analyzing multiple cancer transcription factor (TF) networks, we identified common and distinct features for particular cancer types. These CoDiNA applications demonstrate the successful detection of genes associated with specific phenotypes. Moreover, CoDiNA can also be used for comparing other types of undirected networks, for example, metabolic, protein-protein interaction, ecological and psychometric networks. CoDiNA is publicly available as anRpackage in CRAN (https://CRAN. R-project.org/package=CoDiNA)

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Deep learning models for modeling cellular transcription systems

    Get PDF
    Cellular signal transduction system (CSTS) plays a fundamental role in maintaining homeostasis of a cell by detecting changes in its environment and orchestrates response. Perturbations of CSTS lead to diseases such as cancers. Almost all CSTSs are involved in regulating the expression of certain genes and leading to signature changes in gene expression. Therefore, the gene expression profile of a cell is the readout of the state of its CSTS and could be used to infer CSTS. However, a gene expression profile is a convoluted mixture of the responses to all active signaling pathways in cells. Therefore it is difficult to find the genes associated with an individual pathway. An efficient way of de-convoluting signals embedded in the gene expression profile is needed. At the beginning of the thesis, we applied Pearson correlation coefficient analysis to study cellular signals transduced from ceramide species (lipids) to genes. We found significant correlations between specific ceramide species or ceramide groups and gene expression. We showed that various dihydroceramide families regulated distinct subsets of target genes predicted to participate in distinct biologic processes. However, it’s well known that the signaling pathway structure is hierarchical. Useful information may not be fully detected if only linear models are used to study CSTS. More complex non-linear models are needed to represent the hierarchical structure of CSTS. This motivated us to investigate contemporary deep learning models (DLMs). Later, we applied various deep hierarchical models to learn a distributed representation of statistical structures embedded in transcriptomic data. The models learn and represent the hierarchical organization of transcriptomic machinery. Besides, they provide an abstract representation of the statistical structure of transcriptomic data with flexibility and different degrees of granularity. We showed that deep hierarchical models were capable of learning biologically sensible representations of the data (e.g., the hidden units in the first hidden layer could represent transcription factors) and revealing novel insights regarding the machinery regulating gene expression. We also showed that the model outperformed state-of-the-art methods such as Elastic-Net Linear Regression, Support Vector Machine and Non-Negative Matrix Factorization

    Cell Type-specific Analysis of Human Interactome and Transcriptome

    Get PDF
    Cells are the fundamental building block of complex tissues in higher-order organisms. These cells take different forms and shapes to perform a broad range of functions. What makes a cell uniquely eligible to perform a task, however, is not well-understood; neither is the defining characteristic that groups similar cells together to constitute a cell type. Even for known cell types, underlying pathways that mediate cell type-specific functionality are not readily available. These functions, in turn, contribute to cell type-specific susceptibility in various disorders
    • …
    corecore