9,534 research outputs found

    Automated DNA Motif Discovery

    Get PDF
    Ensembl's human non-coding and protein coding genes are used to automatically find DNA pattern motifs. The Backus-Naur form (BNF) grammar for regular expressions (RE) is used by genetic programming to ensure the generated strings are legal. The evolved motif suggests the presence of Thymine followed by one or more Adenines etc. early in transcripts indicate a non-protein coding gene. Keywords: pseudogene, short and microRNAs, non-coding transcripts, systems biology, machine learning, Bioinformatics, motif, regular expression, strongly typed genetic programming, context-free grammar.Comment: 12 pages, 2 figure

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Identification of motifs in biological sequences using genetic programming

    Get PDF
    Current tools for motif discovery search patterns that are over-represented in DNA sequences but do not use DNA curvature or cofactors associated with the protein bind. We developed a tool that searches for motifs with a variable gap between patterns. The search is done using a genetic programming algorithm that searches for possible models that could be the motif and tries to fit them in a set of positive sequences with the motif against a control dataset. To evaluate the fitness of the organisms we have created an energy model for each component of the regulated bacterial promoters. The final genetic algorithm is able to find hidden motifs in synthetic sequences and real biological sequences.Les eines actuals per al descobriment de motius busquen patrons que estan sobre-representats a les seqüències d'ADN, però no utilitzen la curvatura de l'ADN o cofactors associats a la unió de la proteïna. Hem desenvolupat una eina que busca motius amb un espaiador variable entre patrons. La cerca es fa mitjançant un algorisme de programació genètica que busca possibles models que podrien ser el motiu i intenta encaixar-los en un conjunt de seqüències positives que inclouen el motiu envers un conjunt de seqüències de control. Per avaluar l'encaix dels organismes hem creat un model d'energia per a cada component dels promotors reguladors bacterians. L'algorisme genètic final és capaç de trobar motius ocults a seqüències sintètiques i seqüències reals.Las herramientas actuales para el descubrimiento de motivos buscan patrones que están sobrerepresentados en las secuencias de ADN, pero no usan la curvatura del ADN o cofactores asociados a la unión de la proteína. Hemos desarrollado una herramienta que busca motivos con un espaciado variable entre patrones. La búsqueda se hace mediante un algoritmo de programación genética que busca posibles modelos que podrían ser el motivo y los intenta encajar en un conjunto de secuencias positivas que incluyen el motivo contra un conjunto de secuencias de control. Para evaluar el encaje de los organismos, hemos creado un modelo de energía para cada componente de los promotores reguladores bacterianos. El algoritmo genético final es capaz de encontrar motivos ocultos en secuencias sintéticas y secuencias reales

    Some results on more flexible versions of Graph Motif

    Full text link
    The problems studied in this paper originate from Graph Motif, a problem introduced in 2006 in the context of biological networks. Informally speaking, it consists in deciding if a multiset of colors occurs in a connected subgraph of a vertex-colored graph. Due to the high rate of noise in the biological data, more flexible definitions of the problem have been outlined. We present in this paper two inapproximability results for two different optimization variants of Graph Motif: one where the size of the solution is maximized, the other when the number of substitutions of colors to obtain the motif from the solution is minimized. We also study a decision version of Graph Motif where the connectivity constraint is replaced by the well known notion of graph modularity. While the problem remains NP-complete, it allows algorithms in FPT for biologically relevant parameterizations

    Motif kernel generated by genetic programming improves remote homology and fold detection

    Get PDF
    BACKGROUND: Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. RESULTS: We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection. CONCLUSION: The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods

    Cooperative Metaheuristics for Exploring Proteomic Data

    Get PDF
    Most combinatorial optimization problems cannotbe solved exactly. A class of methods, calledmetaheuristics, has proved its efficiency togive good approximated solutions in areasonable time. Cooperative metaheuristics area sub-set of metaheuristics, which implies aparallel exploration of the search space byseveral entities with information exchangebetween them. The importance of informationexchange in the optimization process is relatedto the building block hypothesis ofevolutionary algorithms, which is based onthese two questions: what is the pertinentinformation of a given potential solution andhow this information can be shared? Aclassification of cooperative metaheuristicsmethods depending on the nature of cooperationinvolved is presented and the specificproperties of each class, as well as a way tocombine them, is discussed. Severalimprovements in the field of metaheuristics arealso given. In particular, a method to regulatethe use of classical genetic operators and todefine new more pertinent ones is proposed,taking advantage of a building block structuredrepresentation of the explored space. Ahierarchical approach resting on multiplelevels of cooperative metaheuristics is finallypresented, leading to the definition of acomplete concerted cooperation strategy. Someapplications of these concepts to difficultproteomics problems, including automaticprotein identification, biological motifinference and multiple sequence alignment arepresented. For each application, an innovativemethod based on the cooperation concept isgiven and compared with classical approaches.In the protein identification problem, a firstlevel of cooperation using swarm intelligenceis applied to the comparison of massspectrometric data with biological sequencedatabase, followed by a genetic programmingmethod to discover an optimal scoring function.The multiple sequence alignment problem isdecomposed in three steps involving severalevolutionary processes to infer different kindof biological motifs and a concertedcooperation strategy to build the sequencealignment according to their motif conten
    corecore