309 research outputs found
A cross-species transcriptomics approach to identify genes involved in leaf development
<p>Abstract</p> <p>Background</p> <p>We have made use of publicly available gene expression data to identify transcription factors and transcriptional modules (regulons) associated with leaf development in <it>Populus</it>. Different tissue types were compared to identify genes informative in the discrimination of leaf and non-leaf tissues. Transcriptional modules within this set of genes were identified in a much wider set of microarray data collected from leaves in a number of developmental, biotic, abiotic and transgenic experiments.</p> <p>Results</p> <p>Transcription factors that were over represented in leaf EST libraries and that were useful for discriminating leaves from other tissues were identified, revealing that the C2C2-YABBY, CCAAT-HAP3 and 5, MYB, and ZF-HD families are particularly important in leaves. The expression of transcriptional modules and transcription factors was examined across a number of experiments to select those that were particularly active during the early stages of leaf development. Two transcription factors were found to collocate to previously published Quantitative Trait Loci (QTL) for leaf length. We also found that miRNA family 396 may be important in the control of leaf development, with three members of the family collocating with clusters of leaf development QTL.</p> <p>Conclusion</p> <p>This work provides a set of candidate genes involved in the control and processes of leaf development. This resource can be used for a wide variety of purposes such as informing the selection of candidate genes for association mapping or for the selection of targets for reverse genetics studies to further understanding of the genetic control of leaf size and shape.</p
Construction of a global map of human gene expression : the process, tools and analysis
This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown.
This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress.
The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology.
A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.Kaikki monisoluisen organismin solut sisältävät saman geenivalikoiman. Solujen ulkonako ja toiminta määräytyvät sen mukaan, mitkä geeniyhdistelmät ovat aktiivisia. Solun geenien ilmentymistä voidaan mitata korkeasaantoisilla molekyylibiologian menetelmillä kuten DNA-siruilla. Tyypillisessä DNA-sirukokeessa mitataan geenien aktiivisuutta pienessä määrässä erilaisia solu- tai kudostyyppejä. Geenien ilmentymisen tutkiminen käyttäen suurempia näytemääriä ei usein ole mahdollista ja tieto aktiivisuuseroista organismitasolla on tuntematta.
Tämä väitöskirja esittelee ihmisen geeniaktiviteetin tutkimukseen käytettävää karttaa sadoista solu- ja kudostyypeistä ja tarkastelee sen rakennetta. Tarkasteltava tieto on kerätty yli 200 erillisestä tutkimuksesta ja sisältää informaatiota geenien ilmentymisestä normaaleissa ja sairaissa solu- ja kudostyypeissä, jotka ovat peräisin yli 160 laboratoriosta. Kartta on luotu yhdistämällä tietoa kahdesta maailman suurimmasta DNA-sirutietokannasta (GEO ja ArrayExpress). Tämän kartan luominen auttoi osaltaan ArrayExpressin kehittämisessä parantamalla tiedon saatavuutta tutkijoille ja korjaamalla tiedossa olevia virheitä. Se oli myös mukana kehittämässä laskennallisia välineitä DNA-sirudatan manipulointiin ja GEOn ja ArrayExpressin välisen tiedon vaihdon luomisessa.
Suurten tietomäärien käsittely ja analysointi on mahdollista vain, jos tieto on järjestetty systemaattisesti. Geenien ilmentymiskarttaan liitettyjen biologisten näytteiden kuvaukset systematisoitiin korvaamalla alkuperäiset näytekuvaukset muutamalla hyvin informatiivisella avainsanalla. Nämä avainsanat järjestettiin edelleen hierarkkisesti. Tätä hierarkiaa käytettiin sitten näytteiden automaattiseen ryhmittelyyn tiedon visualisoinnissa ja analysoinnissa.
On tiedossa, että biologisen näytteen geenien ilmentymisessä havaittavat erot ovat suuremmat, jos mittaukset suoritetaan kahdessa eri laboratoriossa kuin jos mittaus toistetaan samassa laboratoriossa. Koska kattavan geenien ilmentymiskartan luomiseen käytetty tieto tuli monesta laboratoriosta, oli tärkeää varmistaa, että tämä niin sanottu laboratorioefekti ei vinouttasi analyysituloksia. Tästä syystä kaikki kartan luomiseen käytetty tieto tarkastettiin huolellisesti laadun ja vertailukelpoisuuden suhteen.
Alkuperäinen kannuste kattavan ihmisen geenien ilmentymiskartan perustamiseen tuli kahden pahanlaatuisen ihosyöpänäytteen analysoinnista. Ihosyöpätutkimuksen tavoitteena oli tunnistaa geenejä, joiden aktiivisuus olisi kytköksissä pahanlaatuiseen solutyyppiin. Naiden geenien etsintä toi esille pienten solu- ja kudosmäärien käytön rajoitukset ja tarpeen geenien ilmentymisen kokonaisvaltaisempaan tutkimukseen
Recommended from our members
Collective analysis of multiple high-throughput gene expression datasets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonModern technologies have resulted in the production of numerous high-throughput biological datasets. However, the pace of development of capable computational methods does not cope with the pace of generation of new high-throughput datasets. Amongst the most popular biological high-throughput datasets are gene expression datasets (e.g. microarray datasets). This work targets this aspect by proposing a suite of computational methods which can analyse multiple gene expression datasets collectively. The focal method in this suite is the unification of clustering results from multiple datasets using external specifications (UNCLES). This method applies clustering to multiple heterogeneous datasets which measure the expression of the same set of genes separately and then combines the resulting partitions in accordance to one of two types of external specifications; type A identifies the subsets of genes that are consistently co-expressed in all of the given datasets while type B identifies the subsets of genes that are consistently co-expressed in a subset of datasets while being poorly co-expressed in another subset of datasets. This contributes to the types of questions which can addressed by computational methods because existing clustering, consensus clustering, and biclustering methods are inapplicable to address the aforementioned objectives. Moreover, in order to assist in setting some of the parameters required by UNCLES, the M-N scatter plots technique is proposed. These methods, and less mature versions of them, have been validated and applied to numerous real datasets from the biological contexts of budding yeast, bacteria, human red blood cells, and malaria. While collaborating with biologists, these applications have led to various biological insights. In yeast, the role of the poorly-understood gene CMR1 in the yeast cell-cycle has been further elucidated. Also, a novel subset of poorly understood yeast genes has been discovered with an expression profile consistently negatively correlated with the well-known ribosome biogenesis genes. Bacterial data analysis has identified two clusters of negatively correlated genes. Analysis of data from human red blood cells has produced some hypotheses regarding the regulation of the pathways producing such cells. On the other hand, malarial data analysis is still at a preliminary stage. Taken together, this thesis provides an original integrative suite of computational methods which scrutinise multiple gene expression datasets collectively to address previously unresolved questions, and provides the results and findings of many applications of these methods to real biological datasets from multiple contexts.National Institute for Health Research (NIHR) and the Brunel College of Engineering, Design and Physical Science
Analysis of large-scale molecular biological data using self-organizing maps
Modern high-throughput technologies such as microarrays, next generation sequencing and mass spectrometry provide huge amounts of data per measurement and challenge traditional analyses. New strategies of data processing, visualization and functional analysis are inevitable. This thesis presents an approach which applies a machine learning technique known as self organizing maps (SOMs). SOMs enable the parallel sample- and feature-centered view of molecular phenotypes combined with strong visualization and second-level analysis capabilities.
We developed a comprehensive analysis and visualization pipeline based on SOMs. The unsupervised SOM mapping projects the initially high number of features, such as gene expression profiles, to meta-feature clusters of similar and hence potentially co-regulated single features. This reduction of dimension is attained by the re-weighting of primary information and does not entail a loss of primary information in contrast to simple filtering approaches. The meta-data provided by the SOM algorithm is visualized in terms of intuitive mosaic portraits. Sample-specific and common properties shared between samples emerge as a handful of localized spots in the portraits collecting groups of co-regulated and co-expressed meta-features. This characteristic color patterns reflect the data landscape of each sample and promote immediate identification of (meta-)features of interest. It will be demonstrated that SOM portraits transform large and heterogeneous sets of molecular biological data into an atlas of sample-specific texture maps which can be directly compared in terms of similarities and dissimilarities. Spot-clusters of correlated meta-features can be extracted from the SOM portraits in a subsequent step of aggregation. This spot-clustering effectively enables reduction of the dimensionality of the data in two subsequent steps towards a handful of signature modules in an unsupervised fashion.
Furthermore we demonstrate that analysis techniques provide enhanced resolution if applied to the meta-features. The improved discrimination power of meta-features in downstream analyses such as hierarchical clustering, independent component analysis or pairwise correlation analysis is ascribed to essentially two facts: Firstly, the set of meta-features better represents the diversity of patterns and modes inherent in the data and secondly, it also possesses the better signal-to-noise characteristics as a comparable collection of single features.
Additionally to the pattern-driven feature selection in the SOM portraits, we apply statistical measures to detect significantly differential features between sample classes. Implementation of scoring measurements supplements the basal SOM algorithm. Further, two variants of functional enrichment analyses are introduced which link sample specific patterns of the meta-feature landscape with biological knowledge and support functional interpretation of the data based on the ‘guilt by association’ principle.
Finally, case studies selected from different ‘OMIC’ realms are presented in this thesis. In particular, molecular phenotype data derived from expression microarrays (mRNA, miRNA), sequencing (DNA methylation, histone modification patterns) or mass spectrometry (proteome), and also genotype data (SNP-microarrays) is analyzed. It is shown that the SOM analysis pipeline implies strong application
capabilities and covers a broad range of potential purposes ranging from time series and treatment-vs.-control experiments to discrimination of samples according to genotypic, phenotypic or taxonomic classifications
Towards a System Level Understanding of Non-Model Organisms Sampled from the Environment: A Network Biology Approach
The acquisition and analysis of datasets including multi-level omics and physiology from non-model species, sampled from field populations, is a formidable challenge, which so far has prevented the application of systems biology approaches. If successful, these could contribute enormously to improving our understanding of how populations of living organisms adapt to environmental stressors relating to, for example, pollution and climate. Here we describe the first application of a network inference approach integrating transcriptional, metabolic and phenotypic information representative of wild populations of the European flounder fish, sampled at seven estuarine locations in northern Europe with different degrees and profiles of chemical contaminants. We identified network modules, whose activity was predictive of environmental exposure and represented a link between molecular and morphometric indices. These sub-networks represented both known and candidate novel adverse outcome pathways representative of several aspects of human liver pathophysiology such as liver hyperplasia, fibrosis, and hepatocellular carcinoma. At the molecular level these pathways were linked to TNF alpha, TGF beta, PDGF, AGT and VEGF signalling. More generally, this pioneering study has important implications as it can be applied to model molecular mechanisms of compensatory adaptation to a wide range of scenarios in wild populations
Data Integration for Regulatory Module Discovery
Genomic data relating to the functioning of individual genes and their products are
rapidly being produced using many different and diverse experimental techniques.
Each piece of data provides information on a specific aspect of the cell regulation
process. Integration of these diverse types of data is essential in order to identify
biologically relevant regulatory modules. In this thesis, we address this challenge by
analyzing the nature of these datasets and propose new techniques of data integration.
Since microarray data is not available in quantities that are required for valid inference,
many researchers have taken the blind integrative approach where data from
diverse microarray experiments are merged. In order to understand the validity of
this approach, we start this thesis with studying the heterogeneity of microarray
datasets. We have used KL divergence between individual dataset distributions as
well as an empirical technique proposed by us to calculate functional similarity between
the datasets. Our results indicate that we should not use a blind integration
of datasets and much care should be taken to ensure that we mix only similar types
of data. We should also be careful about the choice of normalization method.
Next, we propose a semi-supervised spectral clustering method which integrates two
diverse types of data for the task of gene regulatory module discovery. The technique
uses constraints derived from DNA-binding, PPI and TF-gene interactions datasets
to guide the clustering (spectral) of microarray experiments. Our results on yeast
stress and cell-cycle microarray data indicate that the integration leads to more
biologically significant results.
Finally, we propose a technique that integrates datasets under the principle of maximum
entropy. We argue that this is the most valid approach in an unsupervised
setting where we have no other evidence regarding the weights to be assigned to individual
datasets. Our experiments with yeast microarray, PPI, DNA-binding and
TF-gene interactions datasets show improved biological significance of results
Integrative Modeling of Transcriptional Regulation in Response to Autoimmune Desease Therapies
Die rheumatoide Arthritis (RA) und die Multiple Sklerose (MS) werden allgemein als Autoimmunkrankheiten eingestuft. Zur Behandlung dieser Krankheiten werden immunmodulatorische Medikamente eingesetzt, etwa TNF-alpha-Blocker (z.B. Etanercept) im Falle der RA und IFN-beta-Präparate (z.B. Betaferon und Avonex) im Falle der MS. Bis heute sind die molekularen Mechanismen dieser Therapien weitestgehend unbekannt. Zudem ist ihre Wirksamkeit und Verträglichkeit bei einigen Patienten unzureichend.
In dieser Arbeit wurde die transkriptionelle Antwort im Blut von Patienten auf jede dieser drei Therapien untersucht, um die Wirkungsweise dieser Medikamente besser zu verstehen. Dabei wurden Methoden der Netzwerkinferenz eingesetzt, mit dem Ziel, die genregulatorischen Netzwerke (GRNs) der in ihrer Expression veränderten Gene zu rekonstruieren. Ausgangspunkt dieser Analysen war jeweils ein Genexpressions- Datensatz. Daraus wurden zunächst Gene gefiltert, die nach Therapiebeginn hoch- oder herunterreguliert sind. Anschließend wurden die genregulatorischen Regionen dieser Gene auf Transkriptionsfaktor-Bindestellen (TFBS) analysiert. Um schließlich GRN-Modelle abzuleiten, wurde ein neuer Netzwerkinferenz-Algorithmus (TILAR) verwendet. TILAR unterscheidet zwischen Genen und TF und beschreibt die regulatorischen Effekte zwischen diesen durch ein lineares Gleichungssystem. TILAR erlaubt dabei Vorwissen über Gen-TF- und TF-Gen-Interaktionen einzubeziehen.
Im Ergebnis wurden komplexe Netzwerkstrukturen rekonstruiert, welche die regulatorischen Beziehungen zwischen den Genen beschreiben, die im Verlauf der Therapien differentiell exprimiert sind. Für die Etanercept-Therapie wurde ein Teilnetz gefunden, das Gene enthält, die niedrigere Expressionslevel bei RA-Patienten zeigen, die sehr gut auf das Medikament ansprechen. Die Analyse von GRNs kann somit zu einem besseren Verständnis Therapie-assoziierter Prozesse beitragen und transkriptionelle Unterschiede zwischen Patienten aufzeigen
- …