Search CORE

22 research outputs found

Comparative analysis of long DNA sequences by per element information content using different contexts

Author: BS Crabb
CE Shannon
CS Wallace
D Loewenstern
David R Powell
DR Powell
G Korodi
J Ziv
Julie Bernal
L Allison
L Allison
L Stern
Linda Stern
Lloyd Allison
M Matsuzaki
MD Cao
MJ Gardner
S Grumbach
S Grumbach
Samira Jaeger
TF Smith
Trevor I Dix
X Chen
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features exposed by such models. RESULTS: We apply such a model to find features across chromosomes of Cyanidioschyzon merolae. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts. We also show how to highlight all sets of self-repetition features, in this case within Plasmodium falciparum chromosome 2. CONCLUSION: The methodology finds features that are significant and that biologists confirm. The exploration of long information sequences in linear time and space is fast and the saved results are self documenting.

Crossref

Springer - Publisher Connector

PubMed Central

University of Melbourne Institutional Repository

A genome alignment algorithm based on compression

Author: Allison Lloyd
Cao Minh Duc
Dix Trevor I.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/12/2010
Field of study

University of Queensland eSpace

Polyome: A Learning System for Extracting Bioinformatic Data

Author: Bob Myers
David G Green
Trevor I Dix
Publication venue
Publication date
Field of study

Abstract: The exponential growth in the quantity of publicly available genetic data and the proliferation of bioinformatic databases mean that scientists need computerized tools more than ever. Existing approaches to the problem all suffer from one or more basic problems. This paper describes Polyome, the core of a system for the integration and querying o

CiteSeerX

A Simple Statistical Algorithm for Biological Sequence Compression

Author: Chris Mears
Lloyd Allison
Minh Duc Cao
Trevor I. Dix
Publication venue
Publication date: 01/01/2007
Field of study

This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time. 1

CiteSeerX

Crossref

University of Queensland eSpace

Robust estimation of evolutionary distances with information theory

Author: Allison Lloyd
Boden Mikael
Cao Minh Duc
Dix Trevor I.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 23/02/2016
Field of study

Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evolutionary model assumed. However, there is a lack of established methodology to accommodate the trade-off between incorporating sufficient biological reality and avoiding model overfitting. In addition, as traditional methods measure distances based on the observed number of substitutions, their tend to underestimate distances between diverged sequences due to backward and parallel substitutions. Various techniques were proposed to correct this, but they lack the robustness against sequences that are distantly related and of unequal base frequencies. In this article, we present a novel genetic distance estimate based on information theory that overcomes the above two hurdles. Instead of examining the observed number of substitutions, this method estimates genetic distances using Shannon's mutual information. This naturally provides an effective framework for balancing model complexity and goodness of fit. Our distance estimate is shown to be approximately linear to elapsed time and hence is less sensitive to the divergence of sequence data and compositional biased sequences. Using extensive simulation data, we show that our method 1) consistently reconstructs more accurate phylogeny topologies than existing methods, 2) is robust in extreme conditions such as diverged phylogenies, unequal base frequencies data, and heterogeneous mutation patterns, and 3) scales well with large phylogenies

University of Queensland eSpace

Building classification models from microarray data with tree-based classification algorithms

Author: David L. Dowe
Peter J. Tan
Trevor I. Dix
Publication venue
Publication date
Field of study

Abstract. Building classification models plays an important role in DNA mircroarray data analyses. An essential feature of DNA microarray data sets is that the number of input variables (genes) is far greater than the number of samples. As such, most classification schemes employ variable selection or feature selection methods to pre-process DNA microarray data. This paper investigates various aspects of building classification models from microarray data with tree-based classification algorithms by using Partial Least-Squares (PLS) regression as a feature selection method. Experimental results show that the Partial Least-Squares (PLS) regression method is an appropriate feature selection method and tree-based ensemble models are capable of delivering high performance classification models for microarray data.

CiteSeerX