74 research outputs found

    PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions

    Get PDF
    As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species _Drosophila_ genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE

    A general framework for genome interpretation using evolutionary signatures

    Get PDF
    Includes bibliographical references (p. 55-57).Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.In the post-genomic era, characterized by the availability of the genome sequence data for many species, one of the biggest challenges to be solved is to identify the functional elements in our genome: the small subsequences containing units of biological function. Work has been done to computationally identify specific functional elements such as protein coding genes [11], RNA genes [17], microRNA genes [16], regulatory motifs and individual binding sites for transcription factors and microRNAs [10]. This work has benefited from the use of evolutionary signatures obtained by observing the genomics changes across the sequence data of related species. We propose in this work a general framework to perform functional element identification using evolutionary signatures. We first design several metrics of evolutionary signatures that are meant to capture different patterns of evolution expected from elements that have different biological function as well as novel patterns capturing diverse properties of evolutionary changes. We then compute these metrics for each of the elements in the human genome that are conserved across mammals and other vertebrate species in order to identify classes of functional elements. Based on these metrics, we first perform classification of specific known types of functional elements, such as protein coding sequences, RNA coding sequences and CpG-rich promoters. With success in this step, we go one step further and establish an unsupervised clustering framework for conserved elements based on these metrics. With this approach, we obtain clusters of known and unknown classes of functional elements. We find that some of these clusters correspond to known funtional elements, while others are depleted for known functions, while showing strong evidence of transcription and epigenetic modifications, suggesting these may correspond to novel classes of functional clusters. This illustrates the power of this method in identifying elements of known classes of functionality and to discover elements of novel classes of functionality.by Guilherme Issao Camarinha Fujiwara.M.Eng

    Extensive divergence of transcription factor binding in Drosophila embryos with highly conserved gene expression

    Get PDF
    Extensive divergence of transcription factor binding in Drosophila embryos with highly conserved gene expressionComment: 7 figures, 20 supplementary figures, 6 supplementary tables Paris M, Kaplan T, Li XY, Villalta JE, Lott SE, et al. (2013) Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression. PLoS Genet 9(9): e1003748. doi:10.1371/journal.pgen.100374

    A rebuttal to the comments on the genome order index and the Z-curve

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Elhaik, Graur and Josic recently commented on the genome order index (<it>S</it>) and the <it>Z</it>-curve (Elhaik et al. Biol Direct 2010, 5: 10). <it>S </it>is a quantity defined as <it>S </it>= <it>a</it><sup>2 </sup>+ <it>c</it><sup>2 </sup>+ <it>g</it><sup>2 </sup>+ <it>t</it><sup>2</sup>, where <it>a</it>, <it>c</it>, <it>g </it>and <it>t </it>denote corresponding base frequencies. The <it>Z</it>-curve is a three dimensional curve that represents a DNA sequence in the manner that each can be uniquely reconstructed given the other. Elhaik et al. made 4 major claims. 1) In the previous mapping system with the regular tetrahedron, calculation of the radius of the inscribed sphere is "a mathematical error". 2) <it>S </it>follows an exponential distribution and is narrowly distributed with a range of (0.25 - 0.33). 3) Based on the Chargaff's second parity rule (PR2), "<it>S </it>is equivalent to <it>H </it>[Shannon entropy]" and they are derivable from each other. 4) <it>Z</it>-curve "suffers from over dimensionality", because based on the analysis of 235 bacterial genomes, <it>x </it>and <it>y </it>components contributed only less than 1% of the variance and therefore "would be of little use".</p> <p>Results</p> <p>1) Elhaik et al. mistakenly neglected the parameter <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1745-6150-6-10-i1"><m:mrow><m:mn>4</m:mn><m:mo>/</m:mo><m:msqrt><m:mn>3</m:mn></m:msqrt></m:mrow></m:math></inline-formula> when calculating the radius of the inscribed sphere. 2) The exponential distribution of <it>S </it>is a restatement of our previous conclusion, and the range of (0.25 - 0.33) only paraphrases the previously suggested <it>S </it>range (0.25 -1/3). 3) Elhaik et al. incorrectly disregard deviations from PR2 by treating the deviations as 0 altogether, reduce <it>S </it>and <it>H</it>, both having 4 variables, <it>a, c, g </it>and <it>t</it>, into functions of one single variable, <it>a </it>only, and apply this treatment to all DNA sequences as the basis of their "demonstration", which is therefore invalid. 4) Elhaik et al. confuse numeral smallness with biological insignificance, and disregard the distributions of purine/pyrimidine and amino/keto bases (<it>x </it>and <it>y </it>components), the variations of which, although can be less than that of GC content, contain rich information that is important and useful, such as in locating replication origins of bacterial and archaeal genomes, and in studies of gene recognition in various species.</p> <p>Conclusion</p> <p>Elhaik et al. confuse <it>S </it>(a single number) with <it>Z</it>-curve (a series of 3D coordinates), which are distinct. To use <it>S </it>as a case study of <it>Z</it>-curve, by itself, is invalid. <it>S </it>and <it>H </it>are neither equivalent nor derivable from each other. The criticisms of Elhaik, Graur and Josic are wrong.</p> <p>Reviewers</p> <p>This article was reviewed by Erik van Nimwegen.</p

    Evidence of abundant stop codon readthrough in Drosophila and other Metazoa

    Get PDF
    While translational stop codon readthrough is often used by viral genomes, it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes, the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting protein-coding regions, comparative sequence information from additional species, and directed experimental evidence. We report an expanded set of 283 readthrough candidates, including 16 double-readthrough candidates; these were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report experimental evidence of translation using GFP tagging and mass spectrometry for several readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem–loops, providing clues about readthrough regulation and potential mechanisms. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in several other insect species and one crustacean, and several readthrough candidates in nematode and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized.National Institutes of Health (U.S.) (U54 HG00455-01)National Science Foundation (U.S.) (CAREER 0644282)Alfred P. Sloan Foundatio

    Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters

    Get PDF
    Transcription of long noncoding RNAs (lncRNAs) within gene regulatory elements can modulate gene activity in response to external stimuli, but the scope and functions of such activity are not known. Here we use an ultrahigh-density array that tiles the promoters of 56 cell-cycle genes to interrogate 108 samples representing diverse perturbations. We identify 216 transcribed regions that encode putative lncRNAs, many with RT-PCR–validated periodic expression during the cell cycle, show altered expression in human cancers and are regulated in expression by specific oncogenic stimuli, stem cell differentiation or DNA damage. DNA damage induces five lncRNAs from the CDKN1A promoter, and one such lncRNA, named PANDA, is induced in a p53-dependent manner. PANDA interacts with the transcription factor NF-YA to limit expression of pro-apoptotic genes; PANDA depletion markedly sensitized human fibroblasts to apoptosis by doxorubicin. These findings suggest potentially widespread roles for promoter lncRNAs in cell-growth control.National Institutes of Health (U.S.)National Institute of Arthritis and Musculoskeletal and Skin Diseases (U.S.) (NIAMS) (K08-AR054615))National Cancer Institute (U.S.) (NIH/(NCI) (R01-CA118750))National Cancer Institute (U.S.) (NIH/(NCI) R01-CA130795))Juvenile Diabetes Research Foundation InternationalAmerican Cancer SocietyHoward Hughes Medical Institute (Early career scientist)Stanford University (Graduate Fellowship)National Science Foundation (U.S.) (Graduate Research Fellowship)United States. Dept. of Defense (National Defense Science and Engineering Graduate Fellowship

    Inferring Gene Regulatory Networks from Time Series Microarray Data

    Get PDF
    The innovations and improvements in high-throughput genomic technologies, such as DNA microarray, make it possible for biologists to simultaneously measure dependencies and regulations among genes on a genome-wide scale and provide us genetic information. An important objective of the functional genomics is to understand the controlling mechanism of the expression of these genes and encode the knowledge into gene regulatory network (GRN). To achieve this, computational and statistical algorithms are especially needed. Inference of GRN is a very challenging task for computational biologists because the degree of freedom of the parameters is redundant. Various computational approaches have been proposed for modeling gene regulatory networks, such as Boolean network, differential equations and Bayesian network. There is no so called golden method which can generally give us the best performance for any data set. The research goal is to improve inference accuracy and reduce computational complexity. One of the problems in reconstructing GRN is how to deal with the high dimensionality and short time course gene expression data. In this work, some existing inference algorithms are compared and the limitations lie in that they either suffer from low inference accuracy or computational complexity. To overcome such difficulties, a new approach based on state space model and Expectation-Maximization (EM) algorithms is proposed to model the dynamic system of gene regulation and infer gene regulatory networks. In our model, GRN is represented by a state space model that incorporates noises and has the ability to capture more various biological aspects, such as hidden or missing variables. An EM algorithm is used to estimate the parameters based on the given state space functions and the gene interaction matrix is derived by decomposing the observation matrix using singular value decomposition, and then it is used to infer GRN. The new model is validated using synthetic data sets before applying it to real biological data sets. The results reveal that the developed model can infer the gene regulatory networks from large scale gene expression data and significantly reduce the computational time complexity without losing much inference accuracy compared to dynamic Bayesian network
    • …
    corecore