40 research outputs found

    Construction of a global map of human gene expression : the process, tools and analysis

    Get PDF
    This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.Kaikki monisoluisen organismin solut sisältävät saman geenivalikoiman. Solujen ulkonako ja toiminta määräytyvät sen mukaan, mitkä geeniyhdistelmät ovat aktiivisia. Solun geenien ilmentymistä voidaan mitata korkeasaantoisilla molekyylibiologian menetelmillä kuten DNA-siruilla. Tyypillisessä DNA-sirukokeessa mitataan geenien aktiivisuutta pienessä määrässä erilaisia solu- tai kudostyyppejä. Geenien ilmentymisen tutkiminen käyttäen suurempia näytemääriä ei usein ole mahdollista ja tieto aktiivisuuseroista organismitasolla on tuntematta. Tämä väitöskirja esittelee ihmisen geeniaktiviteetin tutkimukseen käytettävää karttaa sadoista solu- ja kudostyypeistä ja tarkastelee sen rakennetta. Tarkasteltava tieto on kerätty yli 200 erillisestä tutkimuksesta ja sisältää informaatiota geenien ilmentymisestä normaaleissa ja sairaissa solu- ja kudostyypeissä, jotka ovat peräisin yli 160 laboratoriosta. Kartta on luotu yhdistämällä tietoa kahdesta maailman suurimmasta DNA-sirutietokannasta (GEO ja ArrayExpress). Tämän kartan luominen auttoi osaltaan ArrayExpressin kehittämisessä parantamalla tiedon saatavuutta tutkijoille ja korjaamalla tiedossa olevia virheitä. Se oli myös mukana kehittämässä laskennallisia välineitä DNA-sirudatan manipulointiin ja GEOn ja ArrayExpressin välisen tiedon vaihdon luomisessa. Suurten tietomäärien käsittely ja analysointi on mahdollista vain, jos tieto on järjestetty systemaattisesti. Geenien ilmentymiskarttaan liitettyjen biologisten näytteiden kuvaukset systematisoitiin korvaamalla alkuperäiset näytekuvaukset muutamalla hyvin informatiivisella avainsanalla. Nämä avainsanat järjestettiin edelleen hierarkkisesti. Tätä hierarkiaa käytettiin sitten näytteiden automaattiseen ryhmittelyyn tiedon visualisoinnissa ja analysoinnissa. On tiedossa, että biologisen näytteen geenien ilmentymisessä havaittavat erot ovat suuremmat, jos mittaukset suoritetaan kahdessa eri laboratoriossa kuin jos mittaus toistetaan samassa laboratoriossa. Koska kattavan geenien ilmentymiskartan luomiseen käytetty tieto tuli monesta laboratoriosta, oli tärkeää varmistaa, että tämä niin sanottu laboratorioefekti ei vinouttasi analyysituloksia. Tästä syystä kaikki kartan luomiseen käytetty tieto tarkastettiin huolellisesti laadun ja vertailukelpoisuuden suhteen. Alkuperäinen kannuste kattavan ihmisen geenien ilmentymiskartan perustamiseen tuli kahden pahanlaatuisen ihosyöpänäytteen analysoinnista. Ihosyöpätutkimuksen tavoitteena oli tunnistaa geenejä, joiden aktiivisuus olisi kytköksissä pahanlaatuiseen solutyyppiin. Naiden geenien etsintä toi esille pienten solu- ja kudosmäärien käytön rajoitukset ja tarpeen geenien ilmentymisen kokonaisvaltaisempaan tutkimukseen

    Identification of Cancer Related Genes Using a Comprehensive Map of Human Gene Expression

    Get PDF
    Rapid accumulation and availability of gene expression datasets in public repositories have enabled large-scale meta-analyses of combined data. The richness of cross-experiment data has provided new biological insights, including identification of new cancer genes. In this study, we compiled a human gene expression dataset from ∼40,000 publicly available Affymetrix HG-U133Plus2 arrays. After strict quality control and data normalisation the data was quantified in an expression matrix of ∼20,000 genes and ∼28,000 samples. To enable different ways of sample grouping, existing annotations where subjected to systematic ontology assisted categorisation and manual curation. Groups like normal tissues, neoplasmic tissues, cell lines, homoeotic cells and incompletely differentiated cells were created. Unsupervised analysis of the data confirmed global structure of expression consistent with earlier analysis but with more details revealed due to increased resolution. A suitable mixed-effects linear model was used to further investigate gene expression in solid tissue tumours, and to compare these with the respective healthy solid tissues. The analysis identified 1,285 genes with systematic expression change in cancer. The list is significantly enriched with known cancer genes from large, public, peer-reviewed databases, whereas the remaining ones are proposed as new cancer gene candidates. The compiled dataset is publicly available in the ArrayExpress Archive. It contains the most diverse collection of biological samples, making it the largest systematically annotated gene expression dataset of its kind in the public domai

    Importing ArrayExpress datasets into R/Bioconductor

    Get PDF
    Summary:ArrayExpress is one of the largest public repositories of microarray datasets. R/Bioconductor provides a comprehensive suite of microarray analysis and integrative bioinformatics software. However, easy ways for importing datasets from ArrayExpress into R/Bioconductor have been lacking. Here, we present such a tool that is suitable for both interactive and automated use

    Latent regulatory potential of human-specific repetitive elements

    Get PDF
    At least half of the human genome is derived from repetitive elements, which are often lineage specific and silenced by a variety of genetic and epigenetic mechanisms. Using a transchromosomic mouse strain that transmits an almost complete single copy of human chromosome 21 via the female germline, we show that a heterologous regulatory environment can transcriptionally activate transposon-derived human regulatory regions. In the mouse nucleus, hundreds of locations on human chromosome 21 newly associate with activating histone modifications in both somatic and germline tissues, and influence the gene expression of nearby transcripts. These regions are enriched with primate and human lineage-specific transposable elements, and their activation corresponds to changes in DNA methylation at CpG dinucleotides. This study reveals the latent regulatory potential of the repetitive human genome and illustrates the species specificity of mechanisms that control it

    Aberrant methylation of tRNAs links cellular stress to neuro-developmental disorders.

    Get PDF
    Mutations in the cytosine-5 RNA methyltransferase NSun2 cause microcephaly and other neurological abnormalities in mice and human. How post-transcriptional methylation contributes to the human disease is currently unknown. By comparing gene expression data with global cytosine-5 RNA methylomes in patient fibroblasts and NSun2-deficient mice, we find that loss of cytosine-5 RNA methylation increases the angiogenin-mediated endonucleolytic cleavage of transfer RNAs (tRNA) leading to an accumulation of 5' tRNA-derived small RNA fragments. Accumulation of 5' tRNA fragments in the absence of NSun2 reduces protein translation rates and activates stress pathways leading to reduced cell size and increased apoptosis of cortical, hippocampal and striatal neurons. Mechanistically, we demonstrate that angiogenin binds with higher affinity to tRNAs lacking site-specific NSun2-mediated methylation and that the presence of 5' tRNA fragments is sufficient and required to trigger cellular stress responses. Furthermore, the enhanced sensitivity of NSun2-deficient brains to oxidative stress can be rescued through inhibition of angiogenin during embryogenesis. In conclusion, failure in NSun2-mediated tRNA methylation contributes to human diseases via stress-induced RNA cleavage

    Enhancer evolution across 20 mammalian species.

    Get PDF
    The mammalian radiation has corresponded with rapid changes in noncoding regions of the genome, but we lack a comprehensive understanding of regulatory evolution in mammals. Here, we track the evolution of promoters and enhancers active in liver across 20 mammalian species from six diverse orders by profiling genomic enrichment of H3K27 acetylation and H3K4 trimethylation. We report that rapid evolution of enhancers is a universal feature of mammalian genomes. Most of the recently evolved enhancers arise from ancestral DNA exaptation, rather than lineage-specific expansions of repeat elements. In contrast, almost all liver promoters are partially or fully conserved across these species. Our data further reveal that recently evolved enhancers can be associated with genes under positive selection, demonstrating the power of this approach for annotating regulatory adaptations in genomic sequences. These results provide important insight into the functional genetics underpinning mammalian regulatory evolution.We thank Stephen Watt, Frances Connor, the CRUK-CI Genomics and Bioinformatics cores, Biological Resources Unit (Matthew Clayton), Margaret Brown (West Yorkshire bat hospital), Julie E. Horvath (North Carolina Central University), and Chris Dillingham (University of Cardiff) for technical assistance; Matthieu Muffato for assistance with whole-genome alignments; Claudia Kutter, Gordon Brown, Christine Feig, and Christina Ernst for useful comments and discussions, and the EBI systems team for management of computational resources. This research was supported by Cancer Research UK (D.V., D.T.O.), the European Molecular Biology Laboratory (C.B., P.F.), the Wellcome Trust (WT095908) (P.F.) and (WT098051) (P.F., D.T.O.), the European Research Council, EMBO Young Investigator Programme (D.T.O.), the National Science Foundation (0744979) (T.J.P.), NIH (P40 OD010965, R01 OD010980, R37 MH060233) (A.J.J.) and MRC (U117588498) (J.M.A.T.). Cetacean samples were collected by the UK Cetacean Strandings Investigation Programme, funded by Defra and the Governments of Scotland and Wales.This is the final version. It originally appeared at http://www.sciencedirect.com/science/article/pii/S0092867415000070

    Pervasive lesion segregation shapes cancer genome evolution

    Get PDF
    Cancers arise through the acquisition of oncogenic mutations and grow through clonal expansion. Here we reveal that most mutagenic DNA lesions are not resolved as mutations within a single cell-cycle. Instead, DNA lesions segregate unrepaired into daughter cells for multiple cell generations, resulting in the chromosome-scale phasing of subsequent mutations. We characterise this process in mutagen-induced mouse liver tumours and show that DNA replication across persisting lesions can produce multiple alternative alleles in successive cell divisions, thereby generating both multi-allelic and combinatorial genetic diversity. The phasing of lesions enables the accurate measurement of strand biased repair processes, quantification of oncogenic selection, and fine mapping of sister chromatid exchange events. Finally, we demonstrate that lesion segregation is a unifying property of exogenous mutagens, including UV light and chemotherapy agents in human cells and tumours, which has profound implications for the evolution and adaptation of cancer genomes.This work was supported by: Cancer Research UK (20412, 22398), the European Research Council (615584, 682398), the Wellcome Trust (WT108749/Z/15/Z, WT106563/Z/14/A, WT202878/B/16/Z), the European Molecular Biology Laboratory, the MRC Human Genetics Unit core funding programme grants (MC_UU_00007/11, MC_UU_00007/16), and the ERDF/Spanish Ministry of Science, Innovation and Universities-Spanish State Research Agency/DamReMap Project (RTI2018-094095-B-I00)

    Assessing affymetrix GeneChip microarray quality

    No full text
    <p>Abstract</p> <p>Background</p> <p>Microarray technology has become a widely used tool in the biological sciences. Over the past decade, the number of users has grown exponentially, and with the number of applications and secondary data analyses rapidly increasing, we expect this rate to continue. Various initiatives such as the External RNA Control Consortium (ERCC) and the MicroArray Quality Control (MAQC) project have explored ways to provide standards for the technology. For microarrays to become generally accepted as a reliable technology, statistical methods for assessing quality will be an indispensable component; however, there remains a lack of consensus in both defining and measuring microarray quality.</p> <p>Results</p> <p>We begin by providing a precise definition of microarray quality and reviewing existing Affymetrix GeneChip quality metrics in light of this definition. We show that the best-performing metrics require multiple arrays to be assessed simultaneously. While such <it>multi-array </it>quality metrics are adequate for bench science, as microarrays begin to be used in clinical settings, single-array quality metrics will be indispensable. To this end, we define a single-array version of one of the best multi-array quality metrics and show that this metric performs as well as the best multi-array metrics. We then use this new quality metric to assess the quality of microarry data available via the Gene Expression Omnibus (GEO) using more than 22,000 Affymetrix HGU133a and HGU133plus2 arrays from 809 studies.</p> <p>Conclusions</p> <p>We find that approximately 10 percent of these publicly available arrays are of poor quality. Moreover, the quality of microarray measurements varies greatly from hybridization to hybridization, study to study, and lab to lab, with some experiments producing unusable data. Many of the concepts described here are applicable to other high-throughput technologies.</p
    corecore