26 research outputs found
Application of compression-based distance measures to protein sequence classification: a methodological study
Abstract
Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences.
Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith–Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith–Waterman algorithm and two hidden Markov model-based algorithms.
Contact: [email protected]
Supplementary information
ALIGNMENT-FREE METHODS AND ITS APPLICATIONS
Comparing biological sequences remains one of the most vital activities in Bioinformatics. Comparing biological sequences would address the relatedness between species, and find similar structures that might lead to similar functions.
Sequence alignment is the default method, and has been used in the domain for over four decades. It gained a lot of trust, but limitations and even failure has been reported, especially with the new generated genomes. These new generated genomes have bigger size, and to some extent suffer errors. Such errors come mainly as a result from the sequencing machine. These sequencing errors should be considered when submitting sequences to GenBank, for sequence comparison, it is often hard to address or even trace this problem.
Alignment-based methods would fail with such errors, and even if biologists still trust them, reports showed failure with these methods.
The poor results of alignment-based methods with erratic sequences, motivated researchers in the domain to look for alternatives. These alternative methods are alignment-free, and would overcome the shortcomings of alignment-based methods. The work of this thesis is based on alignment-free methods, and it conducts an in-depth study to evaluate these methods, and find the right domain’s application for them. The right domain for alignment-free methods could be by applying them to data that were subjected to manufactured errors, and test the methods provide better comparison results with data that has naturally severe errors. The two techniques used in this work are compression-based and motif-based (or k-mer based, or signal based). We also addressed the selection of the used motifs in the second technique, and how to progress the results by selecting specific motifs that would enhance the quality of results.
In addition, we applied an alignment-free method to a different domain, which is gene prediction. We are using alignment-free in gene prediction to speed up the process of providing high quality results, and predict accurate stretches in the DNA sequence, which would be considered parts of genes
Confronto della comprimibilità lossless con e senza gap di database di sequenze
Analisi dei livelli di compressione di database di sequenze come RNA, Linux Kernel e XML, tramite algoritmo di compressione LZWA basato su pattern con gapope
Sistema de pesquisa automática de sequências de ADN aproximadas e não contÃguas
Mestrado em Engenharia Eletrónica e TelecomunicaçõesA capacidade de efectuar pesquisas de sequências de ADN similares a outras
contidas numa sequência maior, tal como um cromossoma, tem um
papel muito importante no estudo de organismos e na possÃvel ligação entre
espécies diferentes.
Apesar da existência de várias técnicas e algoritmos, criados com o intuito
de realizar pesquisas de sequência, este problema ainda está aberto ao desenvolvimento
de novas ferramentas que possibilitem melhorias em relação
a ferramentas já existentes.
Esta tese apresenta uma solução para pesquisa de sequências, baseada em
compressão de dados, ou, mais especificamente, em modelos de contexto finito, obtendo uma medida de similaridade entre uma referência e um alvo.
O método usa uma abordagem com base em modelos de contexto finito para
obtenção de um modelo estatÃstico da sequência de referência e obtenção
do número estimado de bits necessários para codificação da sequência alvo,
utilizando o modelo da referência.
Ao longo deste trabalho, estudámos o método descrito acima, utilizando,
inicialmente, condições controladas, e, por m, fazendo um estudo de
regiões de ADN do genoma humano moderno, que não se encontram em
ADN ancestral (ou se encontram com elevado grau de dissimilaridade).The ability to search similar DNA sequences with relation to a larger sequence,
such as a chromosome, has a really important role in the study of
organisms and the possible connection between di erent species.
Even though several techniques and algorithms, created with the goal of
performing sequence searches, already exist, this problem is still open to the
development of new tools that exhibit improvements over currently existent
tools.
This thesis proposes a solution for sequence search, based on data compression,
or, speci cally, nite-context models, by obtaining a measure of
similarity between a reference and a target. The method uses an approach
based on nite-context models for the creation of a statistical model of the
reference sequence and obtaining the estimated number of bits necessary
for the codi cation of the target sequence, using the reference model.
In this work we studied the above described method, using, initially, controlled
conditions, and, nally, conducting a study on DNA regions, belonging
to the modern human genome, that can not be found in ancient DNA
(or can only be found with high dissimilarity rate)
A microwear study of Clovis blades from the Gault site, Bell County, Texas
Prehistoric quarries in America are poorly understood and thus problematical to
take into account when making inferences about past behavior. A microwear analysis of
Clovis blades from the 2000 Texas A&M University excavations at the Gault site
(41BL323), located in southern Bell County, Texas, provided a window into this
problem. Texas A&M excavations on the site produced an extraordinarily large number
of Clovis artifacts in two bounded geologic units, 3a and 3b. Included in the artifact
types are blades, specialized elongate flakes associated with a core and blade
technology. In conducting a microwear analysis of the Clovis blades from Gault, I
proposed the following questions: (1) were the Clovis blades utilized at Gault?; (2) is
there a difference in the use-wear patterns of Clovis blades from the geological units 3a
and 3b?; and (3) is Gault, as a quarry/workshop site, a place to just obtain raw materials
or did it also serve as a craft site?
Observations from experiments, stereomicroscope analysis, compound
microscope analysis, and SEM/EDS analysis led to answers for two research questions:
(1) blades were used at Gault and (2) there is a difference between Clovis units 3a and 3b. Eight Clovis 3a blades, or 3.0% of the total Clovis 3a blade/blade fragment
population (n=264), exhibit use-wear. Six Clovis 3b blades, 3.3% of the total Clovis 3b
blade/blade fragment population (n=182), exhibit use-wear. In general, Clovis 3b blades
were used on harder contact materials (wood to bone) than those in Clovis Unit 3a
(softer contact materials similar to grass, sinew, and rawhide).
The function(s) of quarries and quarry-related workshops were interpreted by
William Henry Holmes as a place to obtain raw materials, while Kirk Bryan interpreted
them as a place to bring other materials to work in craft activities. Following the
microwear analysis of Clovis blades/blade fragments at Gault, I compared Gault to three
other Paleoindian quarry-workshop sites (Wells Creek, Dutchess Quarry, and West
Athens Hill). My intent is to provide supplemental data for the consideration when
applying Holmes’ and Bryan’s respective hypotheses