3,167 research outputs found

    No wisdom in the crowd: genome annotation at the time of big data - current status and future prospects

    Get PDF
    Science and engineering rely on the accumulation and dissemination of knowledge to make discoveries and create new designs. Discovery-driven genome research rests on knowledge passed on via gene annotations. In response to the deluge of sequencing big data, standard annotation practice employs automated procedures that rely on majority rules. We argue this hinders progress through the generation and propagation of errors, leading investigators into blind alleys. More subtly, this inductive process discourages the discovery of novelty, which remains essential in biological research and reflects the nature of biology itself. Annotation systems, rather than being repositories of facts, should be tools that support multiple modes of inference. By combining deduction, induction and abduction, investigators can generate hypotheses when accurate knowledge is extracted from model databases. A key stance is to depart from ‘the sequence tells the structure tells the function’ fallacy, placing function first. We illustrate our approach with examples of critical or unexpected pathways, using MicroScope to demonstrate how tools can be implemented following the principles we advocate. We end with a challenge to the reader

    METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

    Get PDF
    High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis

    Genome-wide Determination Of Splicing Efficiency And Dynamics From RNA-Seq Data

    Get PDF
    Eukaryotic genes are mostly composed of a series of exons intercalated by sequences with no coding potential called introns. These sequences are generally removed from primary transcripts to form mature RNA molecules in a post-transcriptional process called splicing. An efficient splicing of primary transcripts is an essential step in gene expression and its misregulation is related to numerous human diseases. Thus, to better understand the dynamics of this process and the perturbations that might be caused by aberrant transcript processing, it is important to quantify splicing efficiency. In this thesis, I introduce SPLICE-q, a fast and user-friendly Python tool for genome-wide SPLICing Efficiency quantification. It supports studies focusing on the implications of splicing efficiency in transcript processing dynamics. SPLICE-q uses aligned reads from RNA-Seq to quantify splicing efficiency for each intron individually and allows the user to select different levels of restrictiveness concerning the introns’ overlap with other genomic elements, such as exons from other genes. I demonstrate SPLICE-q’s application using three use cases including two different species and methodologies. These analyses illustrate that SPLICE-q can detect a progressive increase of splicing efficiency throughout a time course of nascent RNA-Seq and it might be useful when it comes to understanding cancer progression beyond mere gene expression levels. Furthermore, I provide an in-depth study of time course nascent BrU-Seq data to address questions concerning differences in the speed of splicing and the underlying biological features that might be associated with it. SPLICE-q and its documentation are publicly available at: https://github.com/vrmelo/SPLICE-q.Eukaryotische Gene bestehen im Wesentlichen aus einer Reihe von Exons, die durch nicht-kodierende Sequenzen (so genannte Introns) getrennt sind. In einem posttranskriptionellen Prozess, der als Splicing bzw. Spleißen bezeichnet wird, werden diese Sequenzen üblicherweise aus den primären Transkripten entfernt, sodass reife RNA Moleküle entstehen. Effizientes Splicing der primären Transkripte ist ein derart essenzieller Schritt in der Expression von Genen, dass dessen Deregulation Ursache zahlreicher Erkrankungen des menschlichen Körpers ist. Deswegen ist es wichtig die Effizienz des Spleißens robust quantifizieren zu können, um die Dynamik dieses Prozesses und die Auswirkungen der aberranten Prozessierung von Transkripten besser zu verstehen. In diesem Manuskript präsentiere ich SPLICE-q, ein effizientes und benutzerfreundliches Pythonprogramm zur genomweiten Quantifizierung von Spleißeffizienzen (SPLICing Efficiency quantification). Es unterstützt u.a. Studien, die den Effekt von Spleißeffizienz auf die generelle Dynamik der Transkriptprozessierung untersuchen. SPLICE-q benutzt alignierte Reads aus RNA-Seq Experimenten, um die Spleißeffizienz für jedes einzelne Intron zu quantifizieren und erlaubt es dem Benutzer Introns in mehreren unterschiedlich restriktiven Stufen nach deren Überlapp mit anderen genomischen Elementen (bspw. Exons aus anderen Genen) zu filtern. Die Verwendung und Robustheit von SPLICE-q wird anhand von drei verschiedenen Anwendungsbeispielen, inkl. zweier unterschiedlicher Spezies und Methodologien, gezeigt. Diese Analysen demonstrieren, dass SPLICE-q in der Lage ist sowohl, anhand von Daten eines nascent RNA Experiments, einen progressiven Anstieg der Spleißeffizienz über die Zeit festzustellen, als auch zum Verständnis der Entwicklung von Krebszellen, über die bloße Genexpression hinaus, beizutragen. Darüber hinaus, untersucht diese Arbeit eine Zeitreihe aus nascent BrU-Seq-Daten im Detail, um Fragestellungen bzgl. Differenzen in der Spleißgeschwindigkeit in Verbindung mit gewissen biologischen Merkmalen zu klären. Der Quellcode von SPLICE-q und dessen Dokumentation sind öffentlich zugänglich unter: https://github.com/vrmelo/SPLICE-q

    A new reference genome assembly for the microcrustacean Daphnia pulex

    Get PDF
    Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with similar to 7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome

    Quantitative modeling and statistical analysis of protein-DNA binding sites

    Get PDF

    A catalog of stability-associated sequence elements in 3' UTRs of yeast mRNAs

    Get PDF
    BACKGROUND: In recent years, intensive computational efforts have been directed towards the discovery of promoter motifs that correlate with mRNA expression profiles. Nevertheless, it is still not always possible to predict steady-state mRNA expression levels based on promoter signals alone, suggesting that other factors may be involved. Other genic regions, in particular 3' UTRs, which are known to exert regulatory effects especially through controlling RNA stability and localization, were less comprehensively investigated, and deciphering regulatory motifs within them is thus crucial. RESULTS: By analyzing 3' UTR sequences and mRNA decay profiles of Saccharomyces cerevisiae genes, we derived a catalog of 53 sequence motifs that may be implicated in stabilization or destabilization of mRNAs. Some of the motifs correspond to known RNA-binding protein sites, and one of them may act in destabilization of ribosome biogenesis genes during stress response. In addition, we present for the first time a catalog of 23 motifs associated with subcellular localization. A significant proportion of the 3' UTR motifs is highly conserved in orthologous yeast genes, and some of the motifs are strikingly similar to recently published mammalian 3' UTR motifs. We classified all genes into those regulated only at transcription initiation level, only at degradation level, and those regulated by a combination of both. Interestingly, different biological functionalities and expression patterns correspond to such classification. CONCLUSION: The present motif catalogs are a first step towards the understanding of the regulation of mRNA degradation and subcellular localization, two important processes which - together with transcription regulation - determine the cell transcriptome

    Computational analysis of expressed sequence tags for understanding gene regulation.

    Get PDF
    High-throughput sequencing has provided a myriad of genetic data for thousands of organisms. Computational analysis of one data type, expressed sequence tags (ESTs) yields insight into gene expression, alternative splicing, tissue specificity gene functionality and the detection and differentiation of pseudogenes. Two computational methods have been developed to analyze alternative splicing events and to detect and characterize pseudogenes using ESTs. A case study of rat phosphodiesterase 4 (PDE4) genes yielded more than twenty-five previously unreported isoforms. These were experimentally verified through wet lab collaboration and found to be tissue specific. In addition, thirteen cytochrome-like gene and pseudogene sequences from the human genome were analyzed for pseudogene properties. Of the thirteen sequences, one was identified as the actual cytochrome gene, two were found to be non-cytochrome-related sequences, and eight were determined to be pseudogenes. The remaining two sequences were identified to be duplicates. As a precursor to applying the two new methods, the efficiency of three BLAST algorithms (NCBI BLAST, WU BLAST and mpiBLAST) were examined for comparing large numbers of short sequences (ESTs) to fewer large sequences (genomic regions). In general, WU BLAST was found to be the most efficient sequence comparison tool. These approaches illustrate the power of ESTs in understanding gene expression. Efficient computational analysis of ESTs (such as the two tools described) will be vital to understanding the complexity of gene expression as more high-throughput EST data is made available via advances in molecular sequencing technologies, such as the current next-generation approaches
    corecore