Abstract Background Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. Results From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. Conclusion In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.</p

A Apostolico

A Ben-Hur

A Brazma

A Field

A Krogh

B Matthews

C Nevill-Manning

E Eskin

E Gasteiger

E Koonin

E Ukkonen

G Bejerano

G Hertz

G Stolovitzky

G van den Eijkel

I Jonassen

I Rigoutsos

J Buhler

J Han

J Huang

J Yang

JH Zar

K Blekas

M Bramer

M Tompa

NM Abramson

P Baldi

P Pevzner

P Smyth

P Tan

Paulo J Azevedo

Pedro Gabriel Ferreira

PG Ferreira

R Agrawal

R Hart

S Henikoff

S Jensen

S Lonardi

T Attwood

T Wu

V Guralnik

V Neduva

English

PubMed

Pedro Ferreira

Springer - Publisher Connector

Evaluating deterministic motif significance measures in protein databases

Crossref

Abstract Background Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. Results From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. Conclusion In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.</p

Azevedo Paulo J

Ferreira Pedro

Directory of Open Access Journals

Algorithms for Molecular Biology

A scalable algorithm for clustering protein sequences.

A: ExPASy: the proteomics server for in-depth protein knowledge and analysis.

Approaches to the automatic discovery of patterns in biosequences.

Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology

Assessing the accuracy of prediction algorithms for classification: an overview. Bionformatics

Bioinformatics: Sequence, structure and databanks. A Practical Approach. Chapter: Methods for discovering conserved patterns in protein sequences and structures Edited by: Higgins D, Taylor W.

Biological sequence analysis: Probabilistic models of proteins and nucleic acids.

Biostatistical Analysis 3rd edition.

Califano A: Statistical significance of patterns in biosequences.

Califano A: Systematic and Automated Discovery of Patterns in Prosite Families.

Combinatorial approaches to finding subtle signals in DNA sequences.

Combining phylogenetic motif discovery and motif clustering to predict co-regulated genes. Bioinformatics

Comparison of predicted and observed secondary structure of t4 lysozyme.

Computational Methods in Molecular Biology. Chapter: An Introduction to Hidden Markov Models for Biological Sequences Edited by: Salzberg

Data Mining, Concepts and Techniques second edition.

Discovering Patterns and Subfamilies in Biosequences.

Discovering Statistics Using SPSS 2nd edition.

Drabløs F: A survey of motif discovery methods in an integrated framework. Biology Direct

Eijkel G: Intelligent Data Analysis. Chapter: Information-Theoretic Tree and Rule Induction 2nd edition.

Enumerating And Ranking Discrete Motifs.

Fast algorithms for mining association rules.

Finding Flexible Patterns in Unaligned Protein Sequences. Protein Science

Finding motifs using random projections.

Floratos A: Combinatorial pattern discovery in biological sequences. Bioinformatics

Haussler D: Hidden markov models in computational biology: applications to protein modeling.

HK: A survey of DNA motif finding algorithms.

Identification of protein motifs using conserved amino acid properties and partitioning techniques.

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics

Infominer: mining surprising periodic patterns.

Information Theory and Coding McGraw-Hill,

InterPro, progress and status in 2005. Nucleic Acid Research

Likas A: Motif-based protein sequence classification using neural networks.

Modeling protein families using probabilistic suffix trees.

On motifs in biological sequences. citeseer.ist.psu.edu/

Parida L: Conservative extraction of over-represented extensible motifs. Bioinformatics

Pattern discovery in biosequences – Tutorial.

Protein Family Classification based on Searching a Database of Blocks. Genomics

Protein family databases. Encyclopedia of Life Sciences

Protein Sequence Classification through Relevant Sequence Mining and Bayes Classifiers.

Remote homology detection:a motif based approach. Bioinformatics

Rule Induction Using Information Theory

Selecting the right interesting measure for association patterns.

Sequence Motifs: highly predictive features of protein function.

Sequence-Evolution-Function: Computational Approaches in Comparative Genomics

Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology

Tabernero L: Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics . Chapter: The Prints protein fingerprint database: functional and evolutionary applications Edited by:

The emergence of pattern discovery techniques in computational biology. Metabolic Engineering

The emotif database.

The PROSITE database.

Tompa M: Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2254621

Evaluating deterministic motif significance measures in protein databases

Abstract

Similar works

Full text

Available Versions

Springer - Publisher Connector

Crossref

Directory of Open Access Journals