70 research outputs found

    Exact parallel alignment of megabase genomic sequences with tunable work distribution

    Get PDF
    Sequence Alignment is a basic operation in Bioinformatics that is performed thousands of times, on daily basis. The exact methods for pairwise alignment have quadratic time complexity. For this reason, heuristic methods such as BLAST are widely used. To obtain exact results faster, parallel strategies have been proposed but most of them fail to align huge biological sequences. This happens because not only the quadratic time must be considered but also the space should be reduced. In this paper, we evaluate the performance of Z-align, a parallel exact strategy that runs in user-restricted memory space. Also, we propose and evaluate a tunable work distribution mechanism. The results obtained in two clusters show that two sequences of size 24MBP (Mega Base Pairs) and 23MBP, respectively, were successfully aligned with Z-align. Also, in order to align two 3MBP sequences, a speedup of 34.35 was achieved for 64 processors. The evaluation of our work distribution mechanism shows that the execution times can be sensibly reduced when appropriate parameters are chosen. Finally, when comparing Z-align with BLAST, it is clear that, in many cases, Z-align is able to produce alignments with higher score

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF

    DeepGene : gene finding based on upstream sequence data

    Get PDF
    Genome annotation is a process of identifying functional elements along a genome. By correctly locating and finding the information stored within a sequence, knowledge about structural features and functional roles can be revealed. With the number of sequences doubling approximately every 18 months, there is a severe need for automatic annotation of genomes. Today there are many different annotation software tools available, however they produce far from perfect results. Here a new project, DeepGene, is presented. Using data from the RefSeq prokaryotic database we have started an effort to improve on the prokaryotic genome annotation process. This thesis presents the initial efforts of said improvement with a focus on discerning between coding and non-coding sequences using upstream sequence data from open reading frames. Using the 15 prokaryotic genomes available in the RefSeq database, upstream data was retrieved and processed into two datasets, and were then trained using several popular classification models. The performance of the models was compared with a standard annotation tool to create a general baseline for our model. The models created from the datasets show many similarities in terms of metrics. With the K-mer data having a mean precision at 0.22 and mean recall of 0.74, and the sequential data having a mean precision at 0.30 and mean recall at 0.77. Both the datasets performed worse than our standard annotation software with a mean recall and precision of, respectively, 0.83 and 0.82. As far as upstream sequences are concerned, the models managed to pull all the information available from both datasets. The initial results gave limited information in terms of classification and motif presence indicating that other attributes surrounding the genome should be looked at for a possible improvement on the annotation problem. An ideal step forward is to expand into a pipeline so that the complex false negative classifications may be explained.Genomannotering er en prosess som skal identifisere funksjonelle elementer langs et genom. Ved å finne informasjonen lagret i en sekvens kan man avsløre kunnskap rundt strukturelle og funksjonelle roller. Ettersom antall sekvenser dobler rundt hver 18. måned er det et sterkt behov for automatisk gjenkjenning av genomer. I dag er det mange tilgjengelige annoteringsverktøy, men de produserer langt fra perfekte resultater. Et nytt prosjekt ved navn DeepGene er her presentert. Ved hjelp av data fra RefSeq prokaryotiske database har vi startet et forsøk på å forbedre den prokaryotiske annoteringsprosessen. I denne oppgaven presenteres begynnelsen på forbedringen. Hovedfokuset var å skille mellom kodende og ikke-kodende sekvenser ved hjelp av sekvensdata oppstrøms for åpne leserammer. Ved å benytte seg av de 15 prokaryotiske genomene tilgjengelig i RefSeq databasen, ble oppstrømsdata hentet og prosessert til to datasett. Disse datasettene ble videre trent ved hjelp av populære klassifiseringsmodeller. Ytelsen til disse modellene ble sammenlignet med et standard annoteringsverktøy for å lage et generelt utgangspunkt til vår modell. Modellene trent av datasettet viser mange likheter når det kommer til ytelse. K-mer datasettet hadde en gjennomsnittlig presisjon på 0.22 og nøyaktighet på 0.74. Videre hadde det sekvensielle datasettet en gjennomsnittlig presisjon på 0.30 og en nøyaktighet på 0.77. Begge datasettene hadde dårligere resultater enn vårt standard annoteringsverktøy som hadde en gjennomsnittlig nøyaktighet og presisjon på henholdsvis 0.83 og 0.82. Når det kommer til oppstrømssekvenser klarer modellene å hente ut all informasjon tilgjengelig fra datasettene. Resultatene ga begrenset med informasjon når det kommer til klassifisering og motif-tilstedeværelse. Denne begrensningen indikerer at andre attributter rundt genomet bør undersøkes for en mulig forbedring rundt annoteringsproblemet. Et ideelt steg videre er å utvide modellene til en «pipeline» slik at komplekse falske negative klassifiseringer kan bli forklart.M-K

    Introducing deep learning -based methods into the variant calling analysis pipeline

    Get PDF
    Biological interpretation of the genetic variation enhances our understanding of normal and pathological phenotypes, and may lead to the development of new therapeutics. However, it is heavily dependent on the genomic data analysis, which might be inaccurate due to the various sequencing errors and inconsistencies caused by these errors. Modern analysis pipelines already utilize heuristic and statistical techniques, but the rate of falsely identified mutations remains high and variable, particular sequencing technology, settings and variant type. Recently, several tools based on deep neural networks have been published. The neural networks are supposed to find motifs in the data that were not previously seen. The performance of these novel tools is assessed in terms of precision and recall, as well as computational efficiency. Following the established best practices in both variant detection and benchmarking, the discussed tools demonstrate accuracy metrics and computational efficiency that spur further discussion

    Capturing the ‘ome’ : the expanding molecular toolbox for RNA and DNA library construction

    Get PDF
    All sequencing experiments and most functional genomics screens rely on the generation of libraries to comprehensively capture pools of targeted sequences. In the past decade especially, driven by the progress in the field of massively parallel sequencing, numerous studies have comprehensively assessed the impact of particular manipulations on library complexity and quality, and characterized the activities and specificities of several key enzymes used in library construction. Fortunately, careful protocol design and reagent choice can substantially mitigate many of these biases, and enable reliable representation of sequences in libraries. This review aims to guide the reader through the vast expanse of literature on the subject to promote informed library generation, independent of the application

    Bacterial genes and genome dynamics in the environment

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biological Engineering, 2013Cataloged from PDF version of thesis.Includes bibliographical references (p. 143-158).One of the most marvelous features of microbial life is its ability to thrive in such diverse and dynamic environments. My scientific interest lies in the variety of modes by which microbial life accomplishes this feat. In the first half of this thesis I present tools to leverage high throughput sequencing for the study of environmental genomes. In the second half of this thesis, I describe modes of environmental adaptation by bacteria via gene content or gene expression evolution. Associating genes' usage and evolution to adaptation in various environments is a cornerstone of microbiology. New technologies and approaches have revolutionized this pursuit, and I begin by describing the computational challenges I resolved in order to bring these technologies to bear on microbial genomics. In Chapter 1, I describe SHE-RA, an algorithm that increases the useable read length of ultra-high throughput sequencing technologies, thus extending their range of applications to include environmental sequencing. In Chapter 2, I design a new hybrid assembly approach for short reads and assemble 82 Vibrio genomes. Using the ecologically defined groups of this bacterial family, I investigate the genomic and metabolic correlates of habitat and differentiation, and evaluate a neutral model of gene content. In Chapter 3, I report the extent to which orthologous genes in bacteria exhibit the same transcriptional response to the same change in environment, and describe the features and functions of bacterial transcriptional networks that are conserved. I conclude this thesis with a summary of my tools and results, their use in other studies, and their relevance to future work. In particular, I discuss the future experiments and analytical strategies that I am eager to see applied to compelling open questions in microbial ecology and evolution.by Sonia C. Timberlake.Ph.D

    A Cas9 TRIP through chromatin:CRI!PR-Cas9 editing and DNA repair in the context of chromatin

    Get PDF

    Novel bioinformatics tools for epitope-based peptide vaccine design

    Get PDF
    BACKGROUND T-cells are essential in the mediation of immune responses, helping clear bacteria, viruses and cancerous cells. T-cells recognise anomalies in the cellular proteome associated with infection and neoplasms through the T-cell receptor (TCR). The most common TCRs in humans, αβ TCRs, engage processed peptide epitopes presented on the major histocompatibility complex (pMHC). TCR-pMHC interaction is critical to vaccination. In this thesis I will discuss three pieces of software and outcomes derived from them that contribute to epitope-based vaccine design. RESULTS Three pieces of software were developed to help scientists study and understand T-cell responses. The first, STACEI allows users to interrogate the TCR-pMHC crystal structures. The time consuming, error-prone analysis that previously would have to be ran manually, is replaced by a single, flexible package. The second development is the introduction of general-purpose computing on the GPU (GP-GPU) in aiding the prediction of T-cell epitopes by scanning protein datasets using data derived from combinatorial peptide libraries (CPLs). Finally, I introduce RECIPIENT, a reverse vaccinology tool (RV) that combines pangenomic and population genetics methods to predict good vaccine targets across multiple pathogen samples. CONCLUSION Across this thesis, I introduce three different methods that aid the study of T-cells that will hopefully improve future vaccine design. These methods range across data types and methodologies, with methods focusing on mechanistic understanding of the TCR-pMHC binding event; the application of GP-GPU to CPLs and using microbial genomics to aid the study and understanding of antigen-specific T-cell responses. These three methods have a significant potential for further integration, especially the structural methods
    corecore