4 research outputs found

    Algorithmic methods for large-scale genomic and metagenomic data analysis

    Get PDF
    DNA sequencing technologies have advanced into the realm of big data due to frequent and rapid developments in biologic medicine. This has caused a surge in the necessity of efficient and highly scalable algorithms.This dissertation focuses on central work in read-to-reference alignments, resequencing studies, and metagenomics that were designed with these principles as the guiding reason for their construction.First, consider the computing intensive task of read-to-reference alignments, where the difficulty of aligning reads to a genome is directly related their complexity. We investigated three different formulations of sequence complexity as viable tools for measuring genome complexity along with how they related to short read alignments and found that repeat measures of complexity were best suited for this task. In particular, the fraction of distinct substrings of lengths close to the read length was found to correlate very highly to alignment accuracy in terms of precision and recall. All this demonstrated how to build models to predict accuracy of short read aligners with predictably low errors. As a result, practitioners can select the most accurate aligners for an unknown genome by comparing how different models predict alignment accuracy based on the genomes complexity. Furthermore, accurate recall rate prediction may help practitioners reduce expenses by using just enough reads to get sufficient sequencing coverage.Next, focus on the comprehensive task of resequencing studies for analyzing genetic variants of the human population. By using optimal alignments, we revealed that the current variant profiles contained thousands of insertion/deletion (INDEL) that were constructed in a biased manner. The bias is caused by the existence of many theoretically optimal alignments between the reference genome and reads containing alternative alleles at those INDEL locations. We examined several popular aligners and showed that these aligners could be divided into groups whose alignments yielded INDELs that either strongly agreed or disagreed with reported INDELs. This finding suggests that the agreement or disagreement between the aligners called INDEL and the reported INDEL is merely a result of the arbitrary selection of an optimal alignment. Also of note is LongAGE, a memory efficient of Alignment with Gap Excision (AGE) for defining geneomic variant breakpoints, which enables the precise alignment of longer reads or contigs that potentially contain SVs/CNVs while having a trade off of time compared to AGE.Finally, consider several resource-intensive tasks in metagenomics. We introduce a new algorithmic method for detecting unknown bacteria, those whose genomes have not been sequenced, in microbial communities. Using the 16S ribosomal RNA (16S rRNA) gene instead of the whole genomes information is not only computational efficient, but also economical; an analysis that demonstrates the 16S rRNA gene retains sufficient information to allow us to detect unknown bacteria in the context of oral microbial communities is provided. Furthermore, the main hypothesis that the classification or identification of microbes in metagenomic samples is better done with long reads than with short reads is iterated upon, by investigating the performance of popular metagenomic classifiers on short reads and longer reads assembled from those short reads. Higher overall performance of species classification was achieved simply by assembling short reads.These topics about read-to-reference alignments, resequencing studies, and metagenomics are all key focal points in the pages to come. My dissertation delves deeper into these as I cover the contributions my work has made to the field

    Sistema de pesquisa automática de sequências de ADN aproximadas e não contíguas

    Get PDF
    Mestrado em Engenharia Eletrónica e TelecomunicaçõesA capacidade de efectuar pesquisas de sequências de ADN similares a outras contidas numa sequência maior, tal como um cromossoma, tem um papel muito importante no estudo de organismos e na possível ligação entre espécies diferentes. Apesar da existência de várias técnicas e algoritmos, criados com o intuito de realizar pesquisas de sequência, este problema ainda está aberto ao desenvolvimento de novas ferramentas que possibilitem melhorias em relação a ferramentas já existentes. Esta tese apresenta uma solução para pesquisa de sequências, baseada em compressão de dados, ou, mais especificamente, em modelos de contexto finito, obtendo uma medida de similaridade entre uma referência e um alvo. O método usa uma abordagem com base em modelos de contexto finito para obtenção de um modelo estatístico da sequência de referência e obtenção do número estimado de bits necessários para codificação da sequência alvo, utilizando o modelo da referência. Ao longo deste trabalho, estudámos o método descrito acima, utilizando, inicialmente, condições controladas, e, por m, fazendo um estudo de regiões de ADN do genoma humano moderno, que não se encontram em ADN ancestral (ou se encontram com elevado grau de dissimilaridade).The ability to search similar DNA sequences with relation to a larger sequence, such as a chromosome, has a really important role in the study of organisms and the possible connection between di erent species. Even though several techniques and algorithms, created with the goal of performing sequence searches, already exist, this problem is still open to the development of new tools that exhibit improvements over currently existent tools. This thesis proposes a solution for sequence search, based on data compression, or, speci cally, nite-context models, by obtaining a measure of similarity between a reference and a target. The method uses an approach based on nite-context models for the creation of a statistical model of the reference sequence and obtaining the estimated number of bits necessary for the codi cation of the target sequence, using the reference model. In this work we studied the above described method, using, initially, controlled conditions, and, nally, conducting a study on DNA regions, belonging to the modern human genome, that can not be found in ancient DNA (or can only be found with high dissimilarity rate)

    Graph theory-based sequence descriptors as remote homology predictors

    Get PDF
    Indexación: Scopus.Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical–numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.https://www.mdpi.com/2218-273X/10/1/2
    corecore