1,470 research outputs found

    A User's Guide to the Encyclopedia of DNA Elements (ENCODE)

    Get PDF
    The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome

    Novel Bayes Factors That Capture Expert Uncertainty in Prior Density Specification in Genetic Association Studies.

    Get PDF
    Bayes factors (BFs) are becoming increasingly important tools in genetic association studies, partly because they provide a natural framework for including prior information. The Wakefield BF (WBF) approximation is easy to calculate and assumes a normal prior on the log odds ratio (logOR) with a mean of zero. However, the prior variance (W) must be specified. Because of the potentially high sensitivity of the WBF to the choice of W, we propose several new BF approximations with logOR ∼N(0,W), but allow W to take a probability distribution rather than a fixed value. We provide several prior distributions for W which lead to BFs that can be calculated easily in freely available software packages. These priors allow a wide range of densities for W and provide considerable flexibility. We examine some properties of the priors and BFs and show how to determine the most appropriate prior based on elicited quantiles of the prior odds ratio (OR). We show by simulation that our novel BFs have superior true-positive rates at low false-positive rates compared to those from both P-value and WBF analyses across a range of sample sizes and ORs. We give an example of utilizing our BFs to fine-map the CASP8 region using genotype data on approximately 46,000 breast cancer case and 43,000 healthy control samples from the Collaborative Oncological Gene-environment Study (COGS) Consortium, and compare the single-nucleotide polymorphism ranks to those obtained using WBFs and P-values from univariate logistic regression

    MSR1 repeats modulate gene expression and affect risk of breast and prostate cancer

    Get PDF
    [Background] MSR1 repeats are a 36–38 bp minisatellite element that have recently been implicated in the regulation of gene expression, through copy number variation (CNV).[Patients and methods] Bioinformatic and experimental methods were used to assess the distribution of MSR1 across the genome, evaluate the regulatory potential of such elements and explore the role of MSR1 elements in cancer, particularly non-familial breast cancer and prostate cancer.[Results] MSR1s are predominately located at chromosome 19 and are functionally enriched in regulatory regions of the genome, particularly regions implicated in short-range regulatory activities (H3K27ac, H3K4me1 and H3K4me3). MSR1-regulated genes were found to have specific molecular roles, such as serine-protease activity (P = 4.80 × 10−7) and ion channel activity (P = 2.7 × 10−4). The kallikrein locus was found to contain a large number of MSR1 clusters, and at least six of these showed CNV. An MSR1 cluster was identified within KLK14, with 9 and 11 copies being normal variants. A significant association with the 9-copy allele and non-familial breast cancer was found in two independent populations (P = 0.004; P = 0.03). In the white British population, the minor allele conferred an increased risk of 1.21–3.51 times for all non-familial disease, or 1.7–5.3 times in early-onset disease. The 9-copy allele was also found to be associated with increased risk of prostate cancer in an independent population (odds ratio = 1.27–1.56; P =0.009).[Conclusions] MSR1 repeats act as molecular switches that modulate gene expression. It is likely that CNV of MSR1 will affect risk of development of various forms of cancer, including that of breast and prostate. The MSR1 cluster at KLK14 represents the strongest risk factor identified to date in non-familial breast cancer and a significant risk factor for prostate cancer. Analysis of MSR1 genotype will allow development of precise stratification of disease risk and provide a novel target for therapeutic agents.Prostate cancer study is supported by an National Health and Medical Research Council (NHMRC) grant and Career Development Fellowship APP1090505 to JB. The Australian Prostate Cancer BioResource is supported by the NHMRC Enabling Grant APP614296 and by a grant from the Prostate Cancer Foundation, Australia.Peer reviewe

    Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction

    Get PDF
    Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns

    Genome-wide associations of gene expression variation in humans

    Get PDF
    The exploration of quantitative variation in human populations has become one of the major priorities for medical genetics. The successful identification of variants that contribute to complex traits is highly dependent on reliable assays and genetic maps. We have performed a genome-wide quantitative trait analysis of 630 genes in 60 unrelated Utah residents with ancestry from Northern and Western Europe using the publicly available phase I data of the International HapMap project. The genes are located in regions of the human genome with elevated functional annotation and disease interest including the ENCODE regions spanning 1% of the genome, Chromosome 21 and Chromosome 20q12-13.2. We apply three different methods of multiple test correction, including Bonferroni, false discovery rate, and permutations. For the 374 expressed genes, we find many regions with statistically significant association of single nucleotide polymorphisms (SNPs) with expression variation in lymphoblastoid cell lines after correcting for multiple tests. Based on our analyses, the signal proximal (cis-) to the genes of interest is more abundant and more stable than distal and trans across statistical methodologies. Our results suggest that regulatory polymorphism is widespread in the human genome and show that the 5-kb (phase I) HapMap has sufficient density to enable linkage disequilibrium mapping in humans. Such studies will significantly enhance our ability to annotate the non-coding part of the genome and interpret functional variation. In addition, we demonstrate that the HapMap cell lines themselves may serve as a useful resource for quantitative measurements at the cellular level

    Identification and Analysis of Genes and Pseudogenes within Duplicated Regions in the Human and Mouse Genomes

    Get PDF
    The identification and classification of genes and pseudogenes in duplicated regions still constitutes a challenge for standard automated genome annotation procedures. Using an integrated homology and orthology analysis independent of current gene annotation, we have identified 9,484 and 9,017 gene duplicates in human and mouse, respectively. On the basis of the integrity of their coding regions, we have classified them into functional and inactive duplicates, allowing us to define the first consistent and comprehensive collection of 1,811 human and 1,581 mouse unprocessed pseudogenes. Furthermore, of the total of 14,172 human and mouse duplicates predicted to be functional genes, as many as 420 are not included in current reference gene databases and therefore correspond to likely novel mammalian genes. Some of these correspond to partial duplicates with less than half of the length of the original source genes, yet they are conserved and syntenic among different mammalian lineages. The genes and unprocessed pseudogenes obtained here will enable further studies on the mechanisms involved in gene duplication as well as of the fate of duplicated genes

    A Geometric Framework for Evaluating Rare Variant Tests of Association

    Get PDF
    The wave of next‐generation sequencing data has arrived. However, many questions still remain about how to best analyze sequence data, particularly the contribution of rare genetic variants to human disease. Numerous statistical methods have been proposed to aggregate association signals across multiple rare variant sites in an effort to increase statistical power; however, the precise relation between the tests is often not well understood. We present a geometric representation for rare variant data in which rare allele counts in case and control samples are treated as vectors in Euclidean space. The geometric framework facilitates a rigorous classification of existing rare variant tests into two broad categories: tests for a difference in the lengths of the case and control vectors, and joint tests for a difference in either the lengths or angles of the two vectors. We demonstrate that genetic architecture of a trait, including the number and frequency of risk alleles, directly relates to the behavior of the length and joint tests. Hence, the geometric framework allows prediction of which tests will perform best under different disease models. Furthermore, the structure of the geometric framework immediately suggests additional classes and types of rare variant tests. We consider two general classes of tests which show robustness to noncausal and protective variants. The geometric framework introduces a novel and unique method to assess current rare variant methodology and provides guidelines for both applied and theoretical researchers.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/97460/1/gepi21722.pd

    Mapping the <i>Shh</i> long-range regulatory domain

    Get PDF
    Coordinated gene expression controlled by long-distance enhancers is orchestrated by DNA regulatory sequences involving transcription factors and layers of control mechanisms. The Shh gene and well-established regulators are an example of genomic composition in which enhancers reside in a large desert extending into neighbouring genes to control the spatiotemporal pattern of expression. Exploiting the local hopping activity of the Sleeping Beauty transposon, the lacZ reporter gene was dispersed throughout the Shh region to systematically map the genomic features responsible for expression activity. We found that enhancer activities are retained inside a genomic region that corresponds to the topological associated domain (TAD) defined by Hi-C. This domain of approximately 900 kb is in an open conformation over its length and is generally susceptible to all Shh enhancers. Similar to the distal enhancers, an enhancer residing within the Shh second intron activates the reporter gene located at distances of hundreds of kilobases away, suggesting that both proximal and distal enhancers have the capacity to survey the Shh topological domain to recognise potential promoters. The widely expressed Rnf32 gene lying within the Shh domain evades enhancer activities by a process that may be common among other housekeeping genes that reside in large regulatory domains. Finally, the boundaries of the Shh TAD do not represent the absolute expression limits of enhancer activity, as expression activity is lost stepwise at a number of genomic positions at the verges of these domains

    GIVE: portable genome browsers for personal websites.

    Get PDF
    Growing popularity and diversity of genomic data demand portable and versatile genome browsers. Here, we present an open source programming library called GIVE that facilitates the creation of personalized genome browsers without requiring a system administrator. By inserting HTML tags, one can add to a personal webpage interactive visualization of multiple types of genomics data, including genome annotation, "linear" quantitative data, and genome interaction data. GIVE includes a graphical interface called HUG (HTML Universal Generator) that automatically generates HTML code for displaying user chosen data, which can be copy-pasted into user's personal website or saved and shared with collaborators. GIVE is available at: https://www.givengine.org/
    corecore