192 research outputs found

    Identifying statistical dependence in genomic sequences via mutual information estimates

    Get PDF
    Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

    Functional classification of G-Protein coupled receptors, based on their specific ligand coupling patterns

    Get PDF
    Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them re- main as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization

    Parallel Pattern Discovery

    Get PDF
    Üks huvitav uurimisprobleem andmete analüüsimisel on mustriotsing. Mustrid võivad näidata kuidas andmed on tekkinud ja kuidas ta ennast kordab. Andmete mahu kiire kasvamise tõttu on vajadus algoritmidele, mis skaleeruvad mitmele protsessile. Selles töös me uurime kuidas paralleliseerida olemasolevat algoritmi kasutades kolme ideed: üldistamine, liigendamine ja reifitseerimine. Me rakendame neid ideid SPEXS-il, mustriotsingu algoritm, ning tuletame paralleelse algoritmi SPEXS2, mille me ka implementeerime. Lisaks me uurime probleeme, mis tekkisid selle algoritmi implementeerimisel. Selles töös tutvustatud ideid saab kasutada teiste algoritmide üldistamisel ning paralleliseerimisel.An interesting research problem in dataset analysis is the discovery of patterns. Patterns can show how the dataset was formed and how it repeats itself. Due to the fast growth of data collection there is a need for algorithms that can scale with the data. In this thesis we examine how we can take an existing algorithm and make it parallel with three ideas: generalization, decomposition and reification of the existing algorithm. We apply these ideas to SPEXS, a pattern discovery algorithm, and generate a new algorithm SPEXS2, which we also implement. We also analyze several problems when implementing a generic algorithm. The ideas described could be used to parallelize other algorithms as well

    Fast frequent pattern mining.

    Get PDF
    Yabo Xu.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 57-60).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Frequent Pattern Mining --- p.1Chapter 1.2 --- Biosequence Pattern Mining --- p.2Chapter 1.3 --- Organization of the Thesis --- p.4Chapter 2 --- PP-Mine: Fast Mining Frequent Patterns In-Memory --- p.5Chapter 2.1 --- Background --- p.5Chapter 2.2 --- The Overview --- p.6Chapter 2.3 --- PP-tree Representations and Its Construction --- p.7Chapter 2.4 --- PP-Mine --- p.8Chapter 2.5 --- Discussions --- p.14Chapter 2.6 --- Performance Study --- p.15Chapter 3 --- Fast Biosequence Patterns Mining --- p.20Chapter 3.1 --- Background --- p.21Chapter 3.1.1 --- Differences in Biosequences --- p.21Chapter 3.1.2 --- Mining Sequential Patterns --- p.22Chapter 3.1.3 --- Mining Long Patterns --- p.23Chapter 3.1.4 --- Related Works in Bioinformatics --- p.23Chapter 3.2 --- The Overview --- p.24Chapter 3.2.1 --- The Problem --- p.24Chapter 3.2.2 --- The Overview of Our Approach --- p.25Chapter 3.3 --- The Segment Phase --- p.26Chapter 3.3.1 --- Finding Frequent Segments --- p.26Chapter 3.3.2 --- The Index-based Querying --- p.27Chapter 3.3.3 --- The Compression-based Querying --- p.30Chapter 3.4 --- The Pattern Phase --- p.32Chapter 3.4.1 --- The Pruning Strategies --- p.34Chapter 3.4.2 --- The Querying Strategies --- p.37Chapter 3.5 --- Experiment --- p.40Chapter 3.5.1 --- Synthetic Data Sets --- p.40Chapter 3.5.2 --- Biological Data Sets --- p.46Chapter 4 --- Conclusion --- p.55Bibliography --- p.6

    String Matching with Variable Length Gaps

    Get PDF
    We consider string matching with variable length gaps. Given a string TT and a pattern PP consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in TT that match PP. This problem is a basic primitive in computational biology applications. Let mm and nn be the lengths of PP and TT, respectively, and let kk be the number of strings in PP. We present a new algorithm achieving time O(nlogk+m+α)O(n\log k + m +\alpha) and space O(m+A)O(m + A), where AA is the sum of the lower bounds of the lengths of the gaps in PP and α\alpha is the total number of occurrences of the strings in PP within TT. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of mm, nn, kk, AA, and α\alpha. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in PP for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
    corecore