thesis

Towards the identification of regulatory networks using statistical and information theoretical methods on the mammalian transcriptome

Abstract

Our comprehension of the genetic machinery regulating the expression of thousands of different genes controlling cell differentiation or responding to various external signals is still highly incomplete. Furthermore, recently discovered regulatory mechanisms like those mediated by microRNAs expand our knowledge but also add an additional layer of complexity. Since all genes are primarily transcribed into RNA, the genetic activity of gene differential expression can be estimated by measuring the RNA expression. Several techniques to measure large scale gene expression on the basis of RNA have been developed. In this work, data generated with the microarray technology, one of the most commonly used methods, were analyzed towards extracting novel biological regulatory structures. In this work, several aspects on the analysis of these large gene expression data are discussed. Since this is nowadays a common task, a lot has been written about various methods in all its particulars, but often from a more technical or statistical point of view. However, the aim of a biologist planning and carrying out a microarray experiment lies on the acquisition of novel biological findings. In fact, there is still a gap between the experimentalists and the methods developing community. The experimentalists are often not too familiar with the latest fancy method based on modern statistics as it is used in e.g. information theory whereas the developing community normally does not deal extensively with current biological questions. Therefore, the author of this work tries to give an additional view on the field of microarray analysis and the applicability of diverse methods. Hence, the focus is to discuss commonly used methods towards their usage, the underlying biological assumptions and the possible interpretations, pros and cons. Furthermore, beyond ordinary differential gene expression analyses, this work also concentrates on an unbiased search for hidden information in gene expression patterns. In the first section of chapter 1, a general overview about the main biological principles is given. The term transcriptome and its composition of several RNA types will be introduced. Furthermore the mechanism controlling gene expression will be presented. The chapter further explains the basic principles of microarray technology and also discusses the advantages and limitations of this method. Finally, by means of two different biological models, commonly used and a few more specialized and less popular analysis methods will be presented. In doing so, less emphasis is given on a complete and detailed mathematical description, but more on a general applicability and the biological outcome of these tools. Chapter 2 extensively discusses the usage of a blind source separation technique, independent component analysis (ICA), on a two class microarray dataset. Monocytes extracted from human donors were differentiated into macrophages using M-CSF (Macrophage Colony-Stimulating Factor). By applying ICA to the data, so called \textit{expression modes} or \textit{sub-modes} could be extracted. According to referring biological annotations, these sub-modes were then combined to \textit{meta modes} and elaborately discussed. In this way, several known biological signalling pathways as well as regulatory mechanism involved in monocyte differentiation could be reconstructed. Furthermore, a novel biological finding, the remaining proliferative potential of macrophages could also be identified [Lutter et al., 2008]. In chapter 3, again ICA was used, but in this case applied to time-dependent microarray data, and results were compared to a very common analysis method, hierarchical clustering. Time-dependent data was derived from human monocytes infected with the intracellular pathogen F. tularensis. Using the clustering approach, groups of genes referring to distinct timepoints were identified, and a temporal behaviour of genetic immune response could be reconstructed. In parallel, ICA was used to decompose the data into expression modes (analogously to chapter 2). These modes were then mapped on the experimental time course. Compared to the clustering results, the ICA-based reconstructed immune response was more detailed and temporal activity of distinct genes could be resolved more precisely [Lutter et al., 2009]. In the following chapter 4, three different microarray datasets were used to confirm a suggested regulatory mechanism. The observation that about 50% of all microRNAs in humans and mice are intronic and therefore coupled with the expression of protein coding genes, so-called host genes, allowed for the use of established large-scale gene expression measurement techniques to approximate microRNA expression. Since a single microRNA can regulate up to dozens of other protein-coding genes, the hypothesis that this expressional linkage includes an additional functional component was investigated. Using the ordinary clustering algorithm `hierarchical clustering' and an approach based on gene annotations, this hypothesis could be basically confirmed

    Similar works