414 research outputs found

    Use of a novel grammatical inference approach in classification of amyloidogenic hexapeptides

    Get PDF
    The present paper is a novel contribution to the field of bioinformatics by using grammatical inference in the analysis of data. We developed an algorithm for generating star-free regular expressions which turned out to be good recommendation tools, as they are characterized by a relatively high correlation coefficient between the observed and predicted binary classifications. The experiments have been performed for three datasets of amyloidogenic hexapeptides, and our results are compared with those obtained using the graph approaches, the current state-of-the-art methods in heuristic automata induction, and the support vector machine. The results showed the superior performance of the new grammatical inference algorithm on fixed-length amyloid datasets

    Learning the Language of Biological Sequences

    Get PDF
    International audienceLearning the language of biological sequences is an appealing challenge for the grammatical inference research field.While some first successes have already been recorded, such as the inference of profile hidden Markov models or stochastic context-free grammars which are now part of the classical bioinformatics toolbox, it is still a source of open and nice inspirational problems for grammatical inference, enabling us to confront our ideas to real fundamental applications. As an introduction to this field, we survey here the main ideas and concepts behind the approaches developed in pattern/motif discovery and grammatical inference to characterize successfully the biological sequences with their specificities

    Simple and Efficient Local Codes for Distributed Stable Network Construction

    Full text link
    In this work, we study protocols so that populations of distributed processes can construct networks. In order to highlight the basic principles of distributed network construction we keep the model minimal in all respects. In particular, we assume finite-state processes that all begin from the same initial state and all execute the same protocol (i.e. the system is homogeneous). Moreover, we assume pairwise interactions between the processes that are scheduled by an adversary. The only constraint on the adversary scheduler is that it must be fair. In order to allow processes to construct networks, we let them activate and deactivate their pairwise connections. When two processes interact, the protocol takes as input the states of the processes and the state of the their connection and updates all of them. Initially all connections are inactive and the goal is for the processes, after interacting and activating/deactivating connections for a while, to end up with a desired stable network. We give protocols (optimal in some cases) and lower bounds for several basic network construction problems such as spanning line, spanning ring, spanning star, and regular network. We provide proofs of correctness for all of our protocols and analyze the expected time to convergence of most of them under a uniform random scheduler that selects the next pair of interacting processes uniformly at random from all such pairs. Finally, we prove several universality results by presenting generic protocols that are capable of simulating a Turing Machine (TM) and exploiting it in order to construct a large class of networks.Comment: 43 pages, 7 figure

    Path-equivalent developments in acyclic weighted automata

    Get PDF
    International audienceWeighted finite automata (WFA) are used with FPGA accelerating hardware to scan large genomic banks. Hardwiring such automata raises surface area and clock frequency constraints, requiring efficient ε-transitions-removal techniques. In this paper, we present bounds on the number of new transitions for the development of acyclic WFA, which is a special case of the ε-transitions-removal problem. We introduce a new problem, a partial removal of ε-transitions while accepting short chains of ε-transitions

    Comparación de dos algoritmos recientes para inferencia gramatical de lenguajes regulares mediante autómatas no deterministas

    Get PDF
    El desarrollo de nuevos algoritmos, que resulten convergentes y eficientes, es un paso necesario para un uso provechoso de la inferencia gramatical en la solución de problemas reales y de mayor tamaño. En este trabajo se presentan dos algoritmos llamados DeLeTe2 y MRIA, que implementan la inferencia gramatical por medio de autómatas no deterministas, en contraste con los algoritmos más comúnmente empleados, los cuales utilizan autómatas deterministas. Se consideran las ventajas y desventajas de este cambio en el modelo de representación, mediante la descripción detallada y la comparación de los dos algoritmos de inferencia con respecto al enfoque utilizado en su implementación, a su complejidad computacional, a sus criterios de terminación y a su desempeño sobre un cuerpo de datos sintéticos

    Probabilistic grammatical model of protein language and its application to helix-helix contact site classification

    Get PDF
    BACKGROUND: Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. RESULTS: In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. CONCLUSIONS: We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists

    Work ow-based systematic design of high throughput genome annotation

    No full text
    The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome project

    On the Analysis of DNA Methylation

    Get PDF
    Recent genome-wide studies lend support to the idea that the patterns of DNA methylation are in some way related either causally or as a readout of cell-type specific protein binding. We lay the groundwork for a framework to test whether the pattern of DNA methylation levels in a cell combined with protein binding models is sufficient to completely describe the location of the component of proteins binding to its genome in an assayed context. There is only one method, whole-genome bisulfite sequencing, WGBS, available to study DNA methylation genome-wide at such high resolution, however its accuracy has not been determined on the scale of individual binding locations. We address this with a two-fold approach. First, we developed an alternative high-resolution, whole-genome assay using a combination of an enrichment-based and a restriction-enzyme-based assay of methylation, methylCRF. While both assays are considered inferior to WGBS, by using two distinct assays, this method has the advantage that each assay in part cancels out the biases of the other. Additionally, this method is up to 15 times lower in cost than WGBS. By formulating the estimation of methylation from the two methods as a structured prediction problem using a conditional random field, this work will also address the general problem of incorporating data of varying qualities -a common characteristic of biological data- for the purpose of prediction. We show that methylCRF is concordant with WGBS within the range of two WGBS methylomes. Due to the lower cost, we were able to analyze at high-resolution, methylation across more cell-types than previously possible and estimate that 28% of CpGs, in regions comprising 11% of the genome, show variable methylation and are enriched in regulatory regions. Secondly, we show that WGBS has inherent resulution limitations in a read count dependent manner and that the identification of unmethylated regions is highly affected by GC-bias in the underlying protocol suggesting simple estimate procedures may not be sufficient for high-resolution analysis. To address this, we propose a novel approach to DNA methylation analysis using change point detection instead of estimating methylation level directly. However, we show that current change-point detection methods are not robust to methylation signal, we therefore explore how to extend current non-parametric methods to simultaneously find change-points as well as characteristic methylation levels. We believe this framework may have the power to examine the connection between changes in methylation and transcription factor binding in the context of cell-type specific behaviors

    Prospex:ProtocolSpecificationExtraction

    Get PDF
    Protocol reverse engineering is the process of extracting application-level specifications for network protocols. Such specificationsare very useful in a numberof security-related contexts, forexample, to perform deep packet inspectionand black-box fuzzing, or to quickly understand custom botnet command and control (C&C) channels. Since manual reverse engineering is a time-consuming and tedious process, a number of systems have been proposed that aim to automate this task. These systems either analyze network traffic directly or monitor the execution of the application that receivestheprotocolmessages.While previoussystemsshow thatprecise message formatscanbe extractedautomatically, they do not provide a protocol specification. The reason is that they do not reverse engineerthe protocol state machine. In this paper, we focus on closing this gap by presenting a system that is capable of automatically inferring state machines. This greatly enhances the results of automatic protocol reverse engineering, while further reducing the need for human interaction. We extend previous work that focuses on behavior-based message format extraction, and introduce techniques for identifying and clustering different types of messages not only based on their structure, but also accordingto the impact of each message on server behavior. Moreover, we present an algorithm for extracting the state machine. We have applied our techniques to a number of real-world protocols, including the command and control protocol used by a malicious bot. Our results demonstrate that we are able to extract format specifications for different types of messages and meaningful protocol state machines. We use these protocol specifications to automatically generate input for a stateful fuzzer, allowing us to discover security vulnerabilities in real-world applications. 1
    corecore