3,167 research outputs found
No wisdom in the crowd: genome annotation at the time of big data - current status and future prospects
Science and engineering rely on the accumulation
and dissemination of knowledge to make discoveries
and create new designs. Discovery-driven genome
research rests on knowledge passed on via gene
annotations. In response to the deluge of sequencing
big data, standard annotation practice employs automated
procedures that rely on majority rules. We
argue this hinders progress through the generation
and propagation of errors, leading investigators into
blind alleys. More subtly, this inductive process discourages
the discovery of novelty, which remains
essential in biological research and reflects the nature
of biology itself. Annotation systems, rather than
being repositories of facts, should be tools that support
multiple modes of inference. By combining
deduction, induction and abduction, investigators can
generate hypotheses when accurate knowledge is
extracted from model databases. A key stance is to
depart from ‘the sequence tells the structure tells the
function’ fallacy, placing function first. We illustrate
our approach with examples of critical or unexpected
pathways, using MicroScope to demonstrate how
tools can be implemented following the principles we
advocate. We end with a challenge to the reader
METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS
High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows.
To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs.
Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality.
Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis
Genome-wide Determination Of Splicing Efficiency And Dynamics From RNA-Seq Data
Eukaryotic genes are mostly composed of a series of exons intercalated by sequences with no coding potential called introns. These sequences are generally removed from primary transcripts to form mature RNA molecules in a post-transcriptional process called splicing. An efficient splicing of primary transcripts is an essential step in gene expression and its misregulation is related to numerous human diseases. Thus, to better understand the dynamics of this process and the perturbations that might be caused by aberrant transcript processing, it is important to quantify splicing efficiency. In this thesis, I introduce SPLICE-q, a fast and user-friendly Python tool for genome-wide SPLICing Efficiency quantification. It supports studies focusing on the implications of splicing efficiency in transcript processing dynamics. SPLICE-q uses aligned reads from RNA-Seq to quantify splicing efficiency for each intron individually and allows the user to select different levels of restrictiveness concerning the introns’ overlap with other genomic elements, such as exons from other genes. I demonstrate SPLICE-q’s application using three use cases including two different species and methodologies. These analyses illustrate that SPLICE-q can detect a progressive increase of splicing efficiency throughout a time course of nascent RNA-Seq and it might be useful when it comes to understanding cancer progression beyond mere gene expression levels. Furthermore, I provide an in-depth study of time course nascent BrU-Seq data to address questions concerning differences in the speed of splicing and the underlying biological features that might be associated with it. SPLICE-q and its documentation are publicly available at: https://github.com/vrmelo/SPLICE-q.Eukaryotische Gene bestehen im Wesentlichen aus einer Reihe von Exons, die durch nicht-kodierende Sequenzen (so genannte Introns) getrennt sind. In einem posttranskriptionellen Prozess, der als Splicing bzw. Spleißen bezeichnet wird, werden diese Sequenzen üblicherweise aus den primären Transkripten entfernt, sodass reife RNA Moleküle entstehen. Effizientes Splicing der primären Transkripte ist ein derart essenzieller Schritt in der Expression von Genen, dass dessen Deregulation Ursache zahlreicher Erkrankungen des menschlichen Körpers ist. Deswegen ist es wichtig die Effizienz des Spleißens robust quantifizieren zu können, um die Dynamik dieses Prozesses und die Auswirkungen der aberranten Prozessierung von Transkripten besser zu verstehen. In diesem Manuskript präsentiere ich SPLICE-q, ein effizientes und benutzerfreundliches Pythonprogramm zur genomweiten Quantifizierung von Spleißeffizienzen (SPLICing Efficiency quantification). Es unterstützt u.a. Studien, die den Effekt von Spleißeffizienz auf die generelle Dynamik der Transkriptprozessierung untersuchen. SPLICE-q benutzt alignierte Reads aus RNA-Seq Experimenten, um die Spleißeffizienz für jedes einzelne Intron zu quantifizieren und erlaubt es dem Benutzer Introns in mehreren unterschiedlich restriktiven Stufen nach deren Überlapp mit anderen genomischen Elementen (bspw. Exons aus anderen Genen) zu filtern. Die Verwendung und Robustheit von SPLICE-q wird anhand von drei verschiedenen Anwendungsbeispielen, inkl. zweier unterschiedlicher Spezies und Methodologien, gezeigt. Diese Analysen demonstrieren, dass SPLICE-q in der Lage ist sowohl, anhand von Daten eines nascent RNA Experiments, einen progressiven Anstieg der Spleißeffizienz über die Zeit festzustellen, als auch zum Verständnis der Entwicklung von Krebszellen, über die bloße Genexpression hinaus, beizutragen. Darüber hinaus, untersucht diese Arbeit eine Zeitreihe aus nascent BrU-Seq-Daten im Detail, um Fragestellungen bzgl. Differenzen in der Spleißgeschwindigkeit in Verbindung mit gewissen biologischen Merkmalen zu klären. Der Quellcode von SPLICE-q und dessen Dokumentation sind öffentlich zugänglich unter: https://github.com/vrmelo/SPLICE-q
Recommended from our members
Gene Regulatory Compatibility in Bacteria: Consequences for Synthetic Biology and Evolution
Mechanistic understanding of gene regulation is crucial for rational engineering of new genetic systems through synthetic biology. Genetic engineering efforts in new organisms are often hampered by a lack of knowledge about how regulatory components function in new host contexts. This dissertation focuses on efforts to overcome these challenges through the development of generalizable experimental methods for studying the behavior of DNA regulatory sequences in diverse species at large-scale.
Chapter 2 describes experimental approaches for quantitatively assessing the functions of thousands of diverse natural regulatory sequences through a combination of metagenomic mining, high-throughput DNA synthesis and deep sequencing. By employing these methods in three distinct bacterial species, we revealed striking functional differences in gene regulatory capacity. We identified regulatory sequences with activity levels with activity levels spanning several orders of magnitude, which will aid in efforts to engineer diverse bacterial species. We also demonstrate functional species-selective gene circuits with programmable host behaviors that may be useful for microbial community engineering. In Chapter 3 we provide evidence for the evolution of altered stringency in σ70-mediated transcriptional activation based on patterns of initiation and activity from promoters of diverse compositions. We show that the contrast in GC content between a regulatory element and the host genome dictates both the likelihood and the magnitude of expression. We also discuss the potential implications of this proposed mechanism on horizontal gene transfer.
The next two chapters focus on efforts aimed at extending the high-throughput methods described in earlier chapters to new organisms. Chapter 4 presents an in vitro approach for multiplexed gene expression profiling. Through the development and use of cell-free expression systems made from diverse bacteria, it was possible to rapidly acquire thousands of transcriptional measurements in small volume reactions, enabling functional comparisons of regulatory sequence function across multiple species. In Chapter 5 we characterize the restriction-modification system repertoires of several commensal bacterial species. We also describe ongoing efforts to develop methods for bypassing these systems in order to increase transformation efficiencies in species that are difficult or impossible to transform using current approaches
A new reference genome assembly for the microcrustacean Daphnia pulex
Comparing genomes of closely related genotypes from populations with distinct demographic histories can help reveal the impact of effective population size on genome evolution. For this purpose, we present a high quality genome assembly of Daphnia pulex (PA42), and compare this with the first sequenced genome of this species (TCO), which was derived from an isolate from a population with >90% reduction in nucleotide diversity. PA42 has numerous similarities to TCO at the gene level, with an average amino acid sequence identity of 98.8 and >60% of orthologous proteins identical. Nonetheless, there is a highly elevated number of genes in the TCO genome annotation, with similar to 7000 excess genes appearing to be false positives. This view is supported by the high GC content, lack of introns, and short length of these suspicious gene annotations. Consistent with the view that reduced effective population size can facilitate the accumulation of slightly deleterious genomic features, we observe more proliferation of transposable elements (TEs) and a higher frequency of gained introns in the TCO genome
A catalog of stability-associated sequence elements in 3' UTRs of yeast mRNAs
BACKGROUND: In recent years, intensive computational efforts have been directed towards the discovery of promoter motifs that correlate with mRNA expression profiles. Nevertheless, it is still not always possible to predict steady-state mRNA expression levels based on promoter signals alone, suggesting that other factors may be involved. Other genic regions, in particular 3' UTRs, which are known to exert regulatory effects especially through controlling RNA stability and localization, were less comprehensively investigated, and deciphering regulatory motifs within them is thus crucial. RESULTS: By analyzing 3' UTR sequences and mRNA decay profiles of Saccharomyces cerevisiae genes, we derived a catalog of 53 sequence motifs that may be implicated in stabilization or destabilization of mRNAs. Some of the motifs correspond to known RNA-binding protein sites, and one of them may act in destabilization of ribosome biogenesis genes during stress response. In addition, we present for the first time a catalog of 23 motifs associated with subcellular localization. A significant proportion of the 3' UTR motifs is highly conserved in orthologous yeast genes, and some of the motifs are strikingly similar to recently published mammalian 3' UTR motifs. We classified all genes into those regulated only at transcription initiation level, only at degradation level, and those regulated by a combination of both. Interestingly, different biological functionalities and expression patterns correspond to such classification. CONCLUSION: The present motif catalogs are a first step towards the understanding of the regulation of mRNA degradation and subcellular localization, two important processes which - together with transcription regulation - determine the cell transcriptome
Computational analysis of expressed sequence tags for understanding gene regulation.
High-throughput sequencing has provided a myriad of genetic data for thousands of organisms. Computational analysis of one data type, expressed sequence tags (ESTs) yields insight into gene expression, alternative splicing, tissue specificity gene functionality and the detection and differentiation of pseudogenes. Two computational methods have been developed to analyze alternative splicing events and to detect and characterize pseudogenes using ESTs. A case study of rat phosphodiesterase 4 (PDE4) genes yielded more than twenty-five previously unreported isoforms. These were experimentally verified through wet lab collaboration and found to be tissue specific. In addition, thirteen cytochrome-like gene and pseudogene sequences from the human genome were analyzed for pseudogene properties. Of the thirteen sequences, one was identified as the actual cytochrome gene, two were found to be non-cytochrome-related sequences, and eight were determined to be pseudogenes. The remaining two sequences were identified to be duplicates. As a precursor to applying the two new methods, the efficiency of three BLAST algorithms (NCBI BLAST, WU BLAST and mpiBLAST) were examined for comparing large numbers of short sequences (ESTs) to fewer large sequences (genomic regions). In general, WU BLAST was found to be the most efficient sequence comparison tool. These approaches illustrate the power of ESTs in understanding gene expression. Efficient computational analysis of ESTs (such as the two tools described) will be vital to understanding the complexity of gene expression as more high-throughput EST data is made available via advances in molecular sequencing technologies, such as the current next-generation approaches
- …