    Motif Discovery with Compact Approaches - Design and Applications

    In the post-genomic era, the ability to predict the behavior, the function, or the structure of biological entities, as well as interactions among them, plays a fundamental role in the discovery of information to help biologists to explain biological mechanisms. In this context, appropriate characterization of the structures under analysis, and the exploitation of combinatorial properties of sequences, are crucial steps towards the development of efficient algorithms and data structures to be able to perform the analysis of biological sequences. Similarity is a fundamental concept in Biology. Several functional and structural properties, and evolutionary mechanisms, can be predicted comparing new elements with already classified elements, or comparing elements with a similar structure of function to infer the common mechanism that is at the basis of the observed similar behavior. Such elements are commonly called motifs. Comparison-based methods for sequence analysis find their application in several biological contexts, such as identification of transcription factor binding sites, finding structural and functional similarities in proteins, and phylogeny. Therefore the development of adequate methodologies for motif discovery is of paramount interests for several fields in computational biology. In motif discovery in biosequences, it is common to assume that statistically significant candidates are those that are likely to hide some biologically significant property. For this purpose all the possible candidates are ranked according to some statistics on words (frequency, over/under representation, etc.). Then they are presented in output for further inspection by a biologist, who identifies the most promising subsequences, and tests them in laboratory to confirm their biological significance. Therefore, when designing algorithms for motif discovery, besides obviously aim at time and space efficiency, particular attention should be devoted to the output representation. In fact, even considering fixed length strings, the size of the candidate set become exponential if exhaustive enumeration is applied. This is already true when only exact matches are considered as candidate occurrences, and worsen if some kind of variability (for example a fixed number of mismatches is allowed). Alternatively, heuristics could be used, however without the warranty of finding the optimal solution. Computational power of nowadays computers can partially reduce these effects, in particular for short length candidates. However, if the size of the output is too big to be analyzed by human inspection the risk is to provide biologists with very fast, but useless tools. A possible solution relies on compact approaches. Compact approaches are based on the partition of the search space into classes. The classes must be designed in such a way that the score used to rank the candidates has a monotone behavior within each class. This allows the identification of a representative of each class, which is the element with the highest score. Consequently, it suffices to compute, and report in output, the score only for the representatives. In fact, we are guaranteed that for each element that has not been ranked there is another one (the representative of the class it belongs to) that is at least equally significant. The final user can then be presented with an output that has the size of the partition, rather than the size of the candidate space, with obvious advantages for the human-based analysis that follows the computer-based filtering of the pattern discovery algorithm. Compact approaches find applications both in searching and discovery frameworks. They can also be applied to several motif models: exact patterns, patterns with given mismatch distribution, patterns with unknown mismatch distribution, profiles (i.e. matrices), and under both i.i.d. and Markov distributions. The purpose of this chapter is to describe the basis of compact approaches, to provide the readers with the conceptual tools for applying compact approaches to the design of their algorithm for biosequence analysis. Moreover, examples of compact approaches that have been successfully developed for several motif models (e.g. exact words, co-occurrences, words with mismatches, etc) will be explained, and experimental results to discuss their power will be presented

    Bases of motifs for generating repeated patterns with wild cards

    Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed

    Advanced Text Searching of Electronic Information Related to Forensic Discovery

    The Federal Rules of Civil Procedure regarding production of electronic evidence, together with court rulings and penalties, have highlighted the need for timely and accurate production of electronically stored responsive evidence. Key criteria to the legal requirements include costs to produce, identification of responsive information and identification of privileged information within the responsive information. Currently the primary two methods of compliance are manual review of the documents and electronic Boolean text searches. Text searching technology has been studied for over fifty years generating literally thousands of documents and books for a literature review. The focus of the literature includes accuracy of searching, optimization of searching, and completeness of searching. Some of the literature is based on a specific field of interest such as library cards or patent filings, but most is either generic or relates to either peer-to-peer searching or Internet searching. The documents related to the field of electronic evidence are very limited in number and presented no new search techniques directly. We identified and classified the search techniques from the literature study after consideration of the applicability to electronic evidence. Using electronic evidence from actual litigation cases, the techniques were implemented to identify the thoroughness of the documents identified in the population and the related costs (time) required to identify such documents. The results from the various techniques were compared along with the costs to identify the "best" text searching method. Based on the results, we recommend implementation of a combination of the techniques to allow responsiveness to different requirements based on the legal circumstances