20 research outputs found

    Hidden Markov Model Variants and their Application

    Get PDF
    Markov statistical methods may make it possible to develop an unsupervised learning process that can automatically identify genomic structure in prokaryotes in a comprehensive way. This approach is based on mutual information, probabilistic measures, hidden Markov models, and other purely statistical inputs. This approach also provides a uniquely common ground for comparative prokaryotic genomics. The approach is an on-going effort by its nature, as a multi-pass learning process, where each round is more informed than the last, and thereby allows a shift to the more powerful methods available for supervised learning at each iteration. It is envisaged that this "bootstrap" learning process will also be useful as a knowledge discovery tool. For such an ab initio prokaryotic gene-finder to work, however, it needs a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). For eukaryotes, even with better start-of-coding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic gene-predictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles have made use of a novel hash-interpolating MM (hIMM) method. A new implementation for an HMM-with-Duration is also described, with far-reaching application to gene-structure identification and analysis of channel current blockade data

    Analysis of Nanopore Detector Measurements using Machine Learning Methods, with Application to Single-Molecule Kinetics

    Get PDF
    At its core, a nanopore detector has a nanometer-scale biological membrane across which a voltage is applied. The voltage draws a DNA molecule into an á-hemolysin channel in the membrane. Consequently, a distinctive channel current blockade signal is created as the molecule flexes and interacts with the channel. This flexing of the molecule is characterized by different blockade levels in the channel current signal. Previous experiments have shown that a nanopore detector is sufficiently sensitive such that nearly identical DNA molecules were classified successfully using machine learning techniques such as Hidden Markov Models and Support Vector Machines in a channel current based signal analysis platform [4-9]. In this paper, methods for improving feature extraction are presented to improve both classification and to provide biologists and chemists with a better understanding of the physical properties of a given molecule

    Duration learning for analysis of nanopore ionic current blockades

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ionic current blockade signal processing, for use in nanopore detection, offers a promising new way to analyze single molecule properties, with potential implications for DNA sequencing. The alpha-Hemolysin transmembrane channel interacts with a translocating molecule in a nontrivial way, frequently evidenced by a complex ionic flow blockade pattern. Typically, recorded current blockade signals have several levels of blockade, with various durations, all obeying a fixed statistical profile for a given molecule. Hidden Markov Model (HMM) based duration learning experiments on artificial two-level Gaussian blockade signals helped us to identify proper modeling framework. We then apply our framework to the real multi-level DNA hairpin blockade signal.</p> <p>Results</p> <p>The identified upper level blockade state is observed with durations that are geometrically distributed (consistent with an a physical decay process for remaining in any given state). We show that mixture of convolution chains of geometrically distributed states is better for presenting multimodal long-tailed duration phenomena. Based on learned HMM profiles we are able to classify 9 base-pair DNA hairpins with accuracy up to 99.5% on signals from same-day experiments.</p> <p>Conclusion</p> <p>We have demonstrated several implementations for <it>de novo </it>estimation of duration distribution probability density function with HMM framework and applied our model topology to the real data. The proposed design could be handy in molecular analysis based on nanopore current blockade signal.</p

    Analysis of nanopore detector measurements using Machine-Learning methods, with application to single-molecule kinetic analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A nanopore detector has a nanometer-scale trans-membrane channel across which a potential difference is established, resulting in an ionic current through the channel in the pA-nA range. A distinctive channel current blockade signal is created as individually "captured" DNA molecules interact with the channel and modulate the channel's ionic current. The nanopore detector is sensitive enough that nearly identical DNA molecules can be classified with very high accuracy using machine learning techniques such as Hidden Markov Models (HMMs) and Support Vector Machines (SVMs).</p> <p>Results</p> <p>A non-standard implementation of an HMM, emission inversion, is used for improved classification. Additional features are considered for the feature vector employed by the SVM for classification as well: The addition of a single feature representing spike density is shown to notably improve classification results. Another, much larger, feature set expansion was studied (2500 additional features instead of 1), deriving from including all the HMM's transition probabilities. The expanded features can introduce redundant, noisy information (as well as diagnostic information) into the current feature set, and thus degrade classification performance. A hybrid Adaptive Boosting approach was used for feature selection to alleviate this problem.</p> <p>Conclusion</p> <p>The methods shown here, for more informed feature extraction, improve both classification and provide biologists and chemists with tools for obtaining a better understanding of the kinetic properties of molecules of interest.</p

    A novel, fast, HMM-with-Duration implementation – for application with a new, pattern recognition informed, nanopore detector

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Hidden Markov Models (HMMs) provide an excellent means for structure identification and feature extraction on stochastic sequential data. An HMM-with-Duration (HMMwD) is an HMM that can also exactly model the hidden-label length (recurrence) distributions – while the regular HMM will impose a best-fit geometric distribution in its modeling/representation.</p> <p>Results</p> <p>A Novel, Fast, HMM-with-Duration (HMMwD) Implementation is presented, and experimental results are shown that demonstrate its performance on two-state synthetic data designed to model Nanopore Detector Data. The HMMwD experimental results are compared to (i) the ideal model and to (ii) the conventional HMM. Its accuracy is clearly an improvement over the standard HMM, and matches that of the ideal solution in many cases where the standard HMM does not. Computationally, the new HMMwD has all the speed advantages of the conventional (simpler) HMM implementation. In preliminary work shown here, HMM feature extraction is then used to establish the first pattern recognition-informed (PRI) sampling control of a Nanopore Detector Device (on a "live" data-stream).</p> <p>Conclusion</p> <p>The improved accuracy of the new HMMwD implementation, at the same order of computational cost as the standard HMM, is an important augmentation for applications in gene structure identification and channel current analysis, especially PRI sampling control, for example, where speed is essential. The PRI experiment was designed to inherit the high accuracy of the well characterized and distinctive blockades of the DNA hairpin molecules used as controls (or blockade "test-probes"). For this test set, the accuracy inherited is 99.9%.</p

    Cheminformatics methods for novel nanopore analysis of HIV DNA termini

    Get PDF

    Minería de datos sobre comunidades biológicas

    Get PDF
    La práctica científica y tecnológica suele reunir conceptos originados en diversas disciplinas para desarrollar perfiles y potenciales usos que adquieren cierta unidad e independencia conceptual. Tal es el caso de data mining que a partir de la tecnología de las bases de datos incorporó paulatinamente ideas provenientes de la inteligencia artificial y de la estadística para clasificar y/o predecir resultados sobre un muy variado conjunto de sistemas. El proyecto de investigación aquí presentado estudia técnicas bioinformáticas con las que se trabaja sobre comunidades microbiológicas de suelos. Tales métodos tienen el propósito de clasificar los organismos que forman parte del medio y predecir su diversidad. El análisis parte de la representación computacional del ADN que codifica la información genética y establece, con datos obtenidos a partir de muestras, las propiedades del conjunto de microorganismos que conforman esa comunidad. Este tipo de estudio, denominado metagenómica, permite agrupar los distintos tipos de organismos en clusters que representan alguna categoría taxonómica como especie, género, familia etc. También es posible a partir de estos agrupamientos realizar estimaciones de biodiversidad que proporcionen información sobre la potencialidad y riqueza del suelo. El proyecto de investigación tiene dos objetivos. Por un lado establecer un modelo bioinformático markoviano para la comparación de secuencias de ADN a efecto de clasificación, y por otro presentar un análisis crítico de los procedimientos de data mining aplicados a la evaluación de la riqueza en distintos ecosistemas.Eje: Bases de datos y minería de datosRed de Universidades con Carreras en Informática (RedUNCI
    corecore