27 research outputs found

    Outlier detection with partial information:Application to emergency mapping

    Get PDF
    This paper, addresses the problem of novelty detection in the case that the observed data is a mixture of a known 'background' process contaminated with an unknown other process, which generates the outliers, or novel observations. The framework we describe here is quite general, employing univariate classification with incomplete information, based on knowledge of the distribution (the 'probability density function', 'pdf') of the data generated by the 'background' process. The relative proportion of this 'background' component (the 'prior' 'background' 'probability), the 'pdf' and the 'prior' probabilities of all other components are all assumed unknown. The main contribution is a new classification scheme that identifies the maximum proportion of observed data following the known 'background' distribution. The method exploits the Kolmogorov-Smirnov test to estimate the proportions, and afterwards data are Bayes optimally separated. Results, demonstrated with synthetic data, show that this approach can produce more reliable results than a standard novelty detection scheme. The classification algorithm is then applied to the problem of identifying outliers in the SIC2004 data set, in order to detect the radioactive release simulated in the 'oker' data set. We propose this method as a reliable means of novelty detection in the emergency situation which can also be used to identify outliers prior to the application of a more general automatic mapping algorithm. © Springer-Verlag 2007

    Predictive gene lists for breast cancer prognosis: A topographic visualisation study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The controversy surrounding the non-uniqueness of predictive gene lists (PGL) of small selected subsets of genes from very large potential candidates as available in DNA microarray experiments is now widely acknowledged <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Many of these studies have focused on constructing discriminative semi-parametric models and as such are also subject to the issue of random correlations of sparse model selection in high dimensional spaces. In this work we outline a different approach based around an unsupervised patient-specific nonlinear topographic projection in predictive gene lists.</p> <p>Methods</p> <p>We construct nonlinear topographic projection maps based on inter-patient gene-list relative dissimilarities. The Neuroscale, the Stochastic Neighbor Embedding(SNE) and the Locally Linear Embedding(LLE) techniques have been used to construct two-dimensional projective visualisation plots of 70 dimensional PGLs per patient, classifiers are also constructed to identify the prognosis indicator of each patient using the resulting projections from those visualisation techniques and investigate whether <it>a-posteriori </it>two prognosis groups are separable on the evidence of the gene lists.</p> <p>A literature-proposed predictive gene list for breast cancer is benchmarked against a separate gene list using the above methods. Generalisation ability is investigated by using the mapping capability of Neuroscale to visualise the follow-up study, but based on the projections derived from the original dataset.</p> <p>Results</p> <p>The results indicate that small subsets of patient-specific PGLs have insufficient prognostic dissimilarity to permit a distinction between two prognosis patients. Uncertainty and diversity across multiple gene expressions prevents unambiguous or even confident patient grouping. Comparative projections across different PGLs provide similar results.</p> <p>Conclusion</p> <p>The random correlation effect to an arbitrary outcome induced by small subset selection from very high dimensional interrelated gene expression profiles leads to an outcome with associated uncertainty. This continuum and uncertainty precludes any attempts at constructing discriminative classifiers.</p> <p>However a patient's gene expression profile could possibly be used in treatment planning, based on knowledge of other patients' responses.</p> <p>We conclude that many of the patients involved in such medical studies are <it>intrinsically unclassifiable </it>on the basis of provided PGL evidence. This additional category of 'unclassifiable' should be accommodated within medical decision support systems if serious errors and unnecessary adjuvant therapy are to be avoided.</p

    Gene prediction in metagenomic fragments: A large scale machine learning approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.</p> <p>Results</p> <p>We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.</p> <p>Conclusion</p> <p>Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).</p

    Discovering Conformational Sub-States Relevant to Protein Function

    Get PDF
    Background: Internal motions enable proteins to explore a range of conformations, even in the vicinity of native state. The role of conformational fluctuations in the designated function of a protein is widely debated. Emerging evidence suggests that sub-groups within the range of conformations (or sub-states) contain properties that may be functionally relevant. However, low populations in these sub-states and the transient nature of conformational transitions between these substates present significant challenges for their identification and characterization. Methods and Findings: To overcome these challenges we have developed a new computational technique, quasianharmonic analysis (QAA). QAA utilizes higher-order statistics of protein motions to identify sub-states in the conformational landscape. Further, the focus on anharmonicity allows identification of conformational fluctuations that enable transitions between sub-states. QAA applied to equilibrium simulations of human ubiquitin and T4 lysozyme reveals functionally relevant sub-states and protein motions involved in molecular recognition. In combination with a reaction pathway sampling method, QAA characterizes conformational sub-states associated with cis/trans peptidyl-prolyl isomerization catalyzed by the enzyme cyclophilin A. In these three proteins, QAA allows identification of conformational sub-states, with critical structural and dynamical features relevant to protein function. Conclusions: Overall, QAA provides a novel framework to intuitively understand the biophysical basis of conformational diversity and its relevance to protein function. © 2011 Ramanathan et al

    Benchmarking beat classification algorithms

    No full text
    The aim of this study is to compare the accuracy of a range of advanced and classical pattern recognition algorithms for beat and arrhythmia classification from ECG using a principled statistical framework. These are to be used in an application where no patientspecific adaptation of features or model is possible, which means that models must be able to generalise across subjects. Our results demonstrate that non-linear classification models offer significant advantages in ECG beat classification and that with a principled approach to feature selection, pre-processing, and model development, it is possible to get robust inter-subject generalisation even on ambulatory data. 1
    corecore