14,691 research outputs found
Data Discovery and Anomaly Detection Using Atypicality: Theory
A central question in the era of 'big data' is what to do with the enormous
amount of information. One possibility is to characterize it through
statistics, e.g., averages, or classify it using machine learning, in order to
understand the general structure of the overall data. The perspective in this
paper is the opposite, namely that most of the value in the information in some
applications is in the parts that deviate from the average, that are unusual,
atypical. We define what we mean by 'atypical' in an axiomatic way as data that
can be encoded with fewer bits in itself rather than using the code for the
typical data. We show that this definition has good theoretical properties. We
then develop an implementation based on universal source coding, and apply this
to a number of real world data sets.Comment: 40 page
Developing and applying heterogeneous phylogenetic models with XRate
Modeling sequence evolution on phylogenetic trees is a useful technique in
computational biology. Especially powerful are models which take account of the
heterogeneous nature of sequence evolution according to the "grammar" of the
encoded gene features. However, beyond a modest level of model complexity,
manual coding of models becomes prohibitively labor-intensive. We demonstrate,
via a set of case studies, the new built-in model-prototyping capabilities of
XRate (macros and Scheme extensions). These features allow rapid implementation
of phylogenetic models which would have previously been far more
labor-intensive. XRate's new capabilities for lineage-specific models,
ancestral sequence reconstruction, and improved annotation output are also
discussed. XRate's flexible model-specification capabilities and computational
efficiency make it well-suited to developing and prototyping phylogenetic
grammar models. XRate is available as part of the DART software package:
http://biowiki.org/DART .Comment: 34 pages, 3 figures, glossary of XRate model terminolog
Electronic transport in DNA
We study the electronic properties of DNA by way of a tight-binding model applied to four particular DNA sequences. The charge transfer properties are presented in terms of localization lengths (crudely speaking, the length over
which electrons travel). Various types of disorder, including random potentials, are employed to account for different real environments. We have performed calculations on poly(dG)-poly(dC), telomeric-DNA, random-ATGC DNA, and l-DNA. We find that random and l-DNA have localization lengths allowing for electron motion among a few dozen basepairs only. A novel enhancement of localization lengths is observed at particular energies for an increasing binary backbone disorder. We comment on the possible biological relevance of sequence-dependent charge transfer in DNA
Hidden Markov Models for Gene Sequence Classification: Classifying the VSG genes in the Trypanosoma brucei Genome
The article presents an application of Hidden Markov Models (HMMs) for
pattern recognition on genome sequences. We apply HMM for identifying genes
encoding the Variant Surface Glycoprotein (VSG) in the genomes of Trypanosoma
brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa
causative agents of sleeping sickness and several diseases in domestic and wild
animals. These parasites have a peculiar strategy to evade the host's immune
system that consists in periodically changing their predominant cellular
surface protein (VSG). The motivation for using patterns recognition methods to
identify these genes, instead of traditional homology based ones, is that the
levels of sequence identity (amino acid and DNA sequence) amongst these genes
is often below of what is considered reliable in these methods. Among pattern
recognition approaches, HMM are particularly suitable to tackle this problem
because they can handle more naturally the determination of gene edges. We
evaluate the performance of the model using different number of states in the
Markov model, as well as several performance metrics. The model is applied
using public genomic data. Our empirical results show that the VSG genes on T.
brucei can be safely identified (high sensitivity and low rate of false
positives) using HMM.Comment: Accepted article in July, 2015 in Pattern Analysis and Applications,
Springer. The article contains 23 pages, 4 figures, 8 tables and 51
reference
- …