26 research outputs found

    Application of compression-based distance measures to protein sequence classification: a methodological study

    Get PDF
    Abstract Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith–Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith–Waterman algorithm and two hidden Markov model-based algorithms. Contact: [email protected] Supplementary information

    ALIGNMENT-FREE METHODS AND ITS APPLICATIONS

    Get PDF
    Comparing biological sequences remains one of the most vital activities in Bioinformatics. Comparing biological sequences would address the relatedness between species, and find similar structures that might lead to similar functions. Sequence alignment is the default method, and has been used in the domain for over four decades. It gained a lot of trust, but limitations and even failure has been reported, especially with the new generated genomes. These new generated genomes have bigger size, and to some extent suffer errors. Such errors come mainly as a result from the sequencing machine. These sequencing errors should be considered when submitting sequences to GenBank, for sequence comparison, it is often hard to address or even trace this problem. Alignment-based methods would fail with such errors, and even if biologists still trust them, reports showed failure with these methods. The poor results of alignment-based methods with erratic sequences, motivated researchers in the domain to look for alternatives. These alternative methods are alignment-free, and would overcome the shortcomings of alignment-based methods. The work of this thesis is based on alignment-free methods, and it conducts an in-depth study to evaluate these methods, and find the right domain’s application for them. The right domain for alignment-free methods could be by applying them to data that were subjected to manufactured errors, and test the methods provide better comparison results with data that has naturally severe errors. The two techniques used in this work are compression-based and motif-based (or k-mer based, or signal based). We also addressed the selection of the used motifs in the second technique, and how to progress the results by selecting specific motifs that would enhance the quality of results. In addition, we applied an alignment-free method to a different domain, which is gene prediction. We are using alignment-free in gene prediction to speed up the process of providing high quality results, and predict accurate stretches in the DNA sequence, which would be considered parts of genes

    Confronto della comprimibilità lossless con e senza gap di database di sequenze

    Get PDF
    Analisi dei livelli di compressione di database di sequenze come RNA, Linux Kernel e XML, tramite algoritmo di compressione LZWA basato su pattern con gapope

    Sistema de pesquisa automática de sequências de ADN aproximadas e não contíguas

    Get PDF
    Mestrado em Engenharia Eletrónica e TelecomunicaçõesA capacidade de efectuar pesquisas de sequências de ADN similares a outras contidas numa sequência maior, tal como um cromossoma, tem um papel muito importante no estudo de organismos e na possível ligação entre espécies diferentes. Apesar da existência de várias técnicas e algoritmos, criados com o intuito de realizar pesquisas de sequência, este problema ainda está aberto ao desenvolvimento de novas ferramentas que possibilitem melhorias em relação a ferramentas já existentes. Esta tese apresenta uma solução para pesquisa de sequências, baseada em compressão de dados, ou, mais especificamente, em modelos de contexto finito, obtendo uma medida de similaridade entre uma referência e um alvo. O método usa uma abordagem com base em modelos de contexto finito para obtenção de um modelo estatístico da sequência de referência e obtenção do número estimado de bits necessários para codificação da sequência alvo, utilizando o modelo da referência. Ao longo deste trabalho, estudámos o método descrito acima, utilizando, inicialmente, condições controladas, e, por m, fazendo um estudo de regiões de ADN do genoma humano moderno, que não se encontram em ADN ancestral (ou se encontram com elevado grau de dissimilaridade).The ability to search similar DNA sequences with relation to a larger sequence, such as a chromosome, has a really important role in the study of organisms and the possible connection between di erent species. Even though several techniques and algorithms, created with the goal of performing sequence searches, already exist, this problem is still open to the development of new tools that exhibit improvements over currently existent tools. This thesis proposes a solution for sequence search, based on data compression, or, speci cally, nite-context models, by obtaining a measure of similarity between a reference and a target. The method uses an approach based on nite-context models for the creation of a statistical model of the reference sequence and obtaining the estimated number of bits necessary for the codi cation of the target sequence, using the reference model. In this work we studied the above described method, using, initially, controlled conditions, and, nally, conducting a study on DNA regions, belonging to the modern human genome, that can not be found in ancient DNA (or can only be found with high dissimilarity rate)

    A microwear study of Clovis blades from the Gault site, Bell County, Texas

    Get PDF
    Prehistoric quarries in America are poorly understood and thus problematical to take into account when making inferences about past behavior. A microwear analysis of Clovis blades from the 2000 Texas A&M University excavations at the Gault site (41BL323), located in southern Bell County, Texas, provided a window into this problem. Texas A&M excavations on the site produced an extraordinarily large number of Clovis artifacts in two bounded geologic units, 3a and 3b. Included in the artifact types are blades, specialized elongate flakes associated with a core and blade technology. In conducting a microwear analysis of the Clovis blades from Gault, I proposed the following questions: (1) were the Clovis blades utilized at Gault?; (2) is there a difference in the use-wear patterns of Clovis blades from the geological units 3a and 3b?; and (3) is Gault, as a quarry/workshop site, a place to just obtain raw materials or did it also serve as a craft site? Observations from experiments, stereomicroscope analysis, compound microscope analysis, and SEM/EDS analysis led to answers for two research questions: (1) blades were used at Gault and (2) there is a difference between Clovis units 3a and 3b. Eight Clovis 3a blades, or 3.0% of the total Clovis 3a blade/blade fragment population (n=264), exhibit use-wear. Six Clovis 3b blades, 3.3% of the total Clovis 3b blade/blade fragment population (n=182), exhibit use-wear. In general, Clovis 3b blades were used on harder contact materials (wood to bone) than those in Clovis Unit 3a (softer contact materials similar to grass, sinew, and rawhide). The function(s) of quarries and quarry-related workshops were interpreted by William Henry Holmes as a place to obtain raw materials, while Kirk Bryan interpreted them as a place to bring other materials to work in craft activities. Following the microwear analysis of Clovis blades/blade fragments at Gault, I compared Gault to three other Paleoindian quarry-workshop sites (Wells Creek, Dutchess Quarry, and West Athens Hill). My intent is to provide supplemental data for the consideration when applying Holmes’ and Bryan’s respective hypotheses
    corecore