4,005 research outputs found

    METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS

    Get PDF
    High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computational analysis. In this dissertation, I provide novel computational methods to improve analysis throughput in three areas: whole genome multiple alignment, pan-genome annotation, and bioinformatics workflows. To aid in the study of populations, tools are needed that can quickly compare multiple genome sequences, millions of nucleotides in length. I present a new multiple alignment tool for whole genomes, named Mugsy, that implements a novel method for identifying syntenic regions. Mugsy is computationally efficient, does not require a reference genome, and is robust in identifying a rich complement of genetic variation including duplications, rearrangements, and large-scale gain and loss of sequence in mixtures of draft and completed genome data. Mugsy is evaluated on the alignment of several dozen bacterial chromosomes on a single computer and was the fastest program evaluated for the alignment of assembled human chromosome sequences from four individuals. A distributed version of the algorithm is also described and provides increased processing throughput using multiple CPUs. Numerous individual genomes are sequenced to study diversity, evolution and classify pan-genomes. Pan-genome annotations contain inconsistencies and errors that hinder comparative analysis, even within a single species. I introduce a new tool, Mugsy-Annotator, that identifies orthologs and anomalous gene structure across a pan-genome using whole genome multiple alignments. Identified anomalies include inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of pan-genomes indicates that such anomalies are common and alternative annotations suggested by the tool can improve annotation consistency and quality. Finally, I describe the Cloud Virtual Resource, CloVR, a desktop application for automated sequence analysis that improves usability and accessibility of bioinformatics software and cloud computing resources. CloVR is installed on a personal computer as a virtual machine and requires minimal installation, addressing challenges in deploying bioinformatics workflows. CloVR also seamlessly accesses remote cloud computing resources for improved processing throughput. In a case study, I demonstrate the portability and scalability of CloVR and evaluate the costs and resources for microbial sequence analysis

    Bidirectional best hit r-window gene clusters

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Conserved gene clusters </it>are groups of genes that are located close to one another in the genomes of several species. They tend to code for proteins that have a functional interaction. The identification of conserved gene clusters is an important step towards understanding genome evolution and predicting gene function.</p> <p>Results</p> <p>In this paper, we propose a novel pairwise gene cluster model that combines the notion of bidirectional best hits with the <it>r</it>-window model introduced in 2003 by Durand and Sankoff. The bidirectional best hit (BBH) constraint removes the need to specify the minimum number of shared genes in the <it>r</it>-window model and improves the relevance of the results. We design a subquadratic time algorithm to compute the set of BBH <it>r</it>-window gene clusters efficiently.</p> <p>Conclusion</p> <p>We apply our cluster model to the comparative analysis of <it>E. coli </it>K-12 and <it>B. subtilis </it>and perform an extensive comparison between our new model and the gene teams model developed by Bergeron <it>et al</it>. As compared to the gene teams model, our new cluster model has a slightly lower recall but a higher precision at all levels of recall when the results were ranked using statistical tests. An analysis of the most significant BBH <it>r</it>-window gene cluster show that they correspond to known operons.</p

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    MMseqs: ultra fast and sensitive clustering and search of large protein sequence databases

    Get PDF

    Algorithmic methods for large-scale genomic and metagenomic data analysis

    Get PDF
    DNA sequencing technologies have advanced into the realm of big data due to frequent and rapid developments in biologic medicine. This has caused a surge in the necessity of efficient and highly scalable algorithms.This dissertation focuses on central work in read-to-reference alignments, resequencing studies, and metagenomics that were designed with these principles as the guiding reason for their construction.First, consider the computing intensive task of read-to-reference alignments, where the difficulty of aligning reads to a genome is directly related their complexity. We investigated three different formulations of sequence complexity as viable tools for measuring genome complexity along with how they related to short read alignments and found that repeat measures of complexity were best suited for this task. In particular, the fraction of distinct substrings of lengths close to the read length was found to correlate very highly to alignment accuracy in terms of precision and recall. All this demonstrated how to build models to predict accuracy of short read aligners with predictably low errors. As a result, practitioners can select the most accurate aligners for an unknown genome by comparing how different models predict alignment accuracy based on the genomes complexity. Furthermore, accurate recall rate prediction may help practitioners reduce expenses by using just enough reads to get sufficient sequencing coverage.Next, focus on the comprehensive task of resequencing studies for analyzing genetic variants of the human population. By using optimal alignments, we revealed that the current variant profiles contained thousands of insertion/deletion (INDEL) that were constructed in a biased manner. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that either strongly agreed or disagreed with reported INDELs. This finding suggests that the agreement or disagreement between the aligners called INDEL and the reported INDEL is merely a result of the arbitrary selection of an optimal alignment. Also of note is LongAGE, a memory efficient of Alignment with Gap Excision (AGE) for defining geneomic variant breakpoints, which enables the precise alignment of longer reads or contigs that potentially contain SVs/CNVs while having a trade off of time compared to AGE.Finally, consider several resource-intensive tasks in metagenomics. We introduce a new algorithmic method for detecting unknown bacteria, those whose genomes have not been sequenced, in microbial communities. Using the 16S ribosomal RNA (16S rRNA) gene instead of the whole genomes information is not only computational efficient, but also economical; an analysis that demonstrates the 16S rRNA gene retains sufficient information to allow us to detect unknown bacteria in the context of oral microbial communities is provided. Furthermore, the main hypothesis that the classification or identification of microbes in metagenomic samples is better done with long reads than with short reads is iterated upon, by investigating the performance of popular metagenomic classifiers on short reads and longer reads assembled from those short reads. Higher overall performance of species classification was achieved simply by assembling short reads.These topics about read-to-reference alignments, resequencing studies, and metagenomics are all key focal points in the pages to come. My dissertation delves deeper into these as I cover the contributions my work has made to the field

    SUFFIX TREE, MINWISE HASHING AND STREAMING ALGORITHMS FOR BIG DATA ANALYSIS IN BIOINFORMATICS

    Get PDF
    In this dissertation, we worked on several algorithmic problems in bioinformatics using mainly three approaches: (a) a streaming model, (b) sux-tree based indexing, and (c) minwise-hashing (minhash) and locality-sensitive hashing (LSH). The streaming models are useful for large data problems where a good approximation needs to be achieved with limited space usage. We developed an approximation algorithm (Kmer-Estimate) using the streaming approach to obtain a better estimation of the frequency of k-mer counts. A k-mer, a subsequence of length k, plays an important role in many bioinformatics analyses such as genome distance estimation. We also developed new methods that use sux tree, a trie data structure, for alignment-free, non-pairwise algorithms for a conserved non-coding sequence (CNS) identification problem. We provided two different algorithms: STAG-CNS to identify exact-matched CNSs and DiCE to identify CNSs with mismatches. Using our algorithms, CNSs among various grass species were identified. A different approach was employed for identification of longer CNSs ( 100 bp, mostly found in animals). In our new method (MinCNE), the minhash approach was used to estimate the Jaccard similarity. Using also LSH, k-mers extracted from genomic sequences were clustered and CNSs were identified. Another new algorithm (MinIsoClust) that also uses minhash and LSH techniques was developed for an isoform clustering problem. Isoforms are generated from the same gene but by alternative splicing. As the isoform sequences share some exons but in different combinations, regular sequencing clustering methods do not work well. Our algorithm generates clusters for isoform sequences based on their shared minhash signatures. Finally, we discuss de novo transcriptome assembly algorithms and how to improve the assembly accuracy using ensemble approaches. First, we did a comprehensive performance analysis on different transcriptome assemblers using simulated benchmark datasets. Then, we developed a new ensemble approach (Minsemble) for the de novo transcriptome assembly problem that integrates isoform-clustering using minhash technique to identify potentially correct transcripts from various de novo transcriptome assemblers. Minsemble identified more correctly assembled transcripts as well as genes compared to other de novo and ensemble methods. Adviser: Jitender S. Deogu
    corecore