192 research outputs found
Identifying statistical dependence in genomic sequences via mutual information estimates
Questions of understanding and quantifying the representation and amount of
information in organisms have become a central part of biological research, as
they potentially hold the key to fundamental advances. In this paper, we
demonstrate the use of information-theoretic tools for the task of identifying
segments of biomolecules (DNA or RNA) that are statistically correlated. We
develop a precise and reliable methodology, based on the notion of mutual
information, for finding and extracting statistical as well as structural
dependencies. A simple threshold function is defined, and its use in
quantifying the level of significance of dependencies between biological
segments is explored. These tools are used in two specific applications. First,
for the identification of correlations between different parts of the maize
zmSRp32 gene. There, we find significant dependencies between the 5'
untranslated region in zmSRp32 and its alternatively spliced exons. This
observation may indicate the presence of as-yet unknown alternative splicing
mechanisms or structural scaffolds. Second, using data from the FBI's Combined
DNA Index System (CODIS), we demonstrate that our approach is particularly well
suited for the problem of discovering short tandem repeats, an application of
importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on
Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb
Functional classification of G-Protein coupled receptors, based on their specific ligand coupling patterns
Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them re- main as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization
Recommended from our members
Topology-based protein structure comparison using a pattern discovery technique
Parallel Pattern Discovery
Üks huvitav uurimisprobleem andmete analüüsimisel on mustriotsing. Mustrid võivad näidata kuidas andmed on tekkinud ja kuidas ta ennast kordab. Andmete mahu kiire kasvamise tõttu on vajadus algoritmidele, mis skaleeruvad mitmele protsessile. Selles töös me uurime kuidas paralleliseerida olemasolevat algoritmi kasutades kolme ideed: üldistamine, liigendamine ja reifitseerimine. Me rakendame neid ideid SPEXS-il, mustriotsingu algoritm, ning tuletame paralleelse algoritmi SPEXS2, mille me ka implementeerime. Lisaks me uurime probleeme, mis tekkisid selle algoritmi implementeerimisel. Selles töös tutvustatud ideid saab kasutada teiste algoritmide üldistamisel ning paralleliseerimisel.An interesting research problem in dataset analysis is the discovery of patterns. Patterns can show how the dataset was formed and how it repeats itself. Due to the fast growth of data collection there is a need for algorithms that can scale with the data. In this thesis we examine how we can take an existing algorithm and make it parallel with three ideas: generalization, decomposition and reification of the existing algorithm. We apply these ideas to SPEXS, a pattern discovery algorithm, and generate a new algorithm SPEXS2, which we also implement. We also analyze several problems when implementing a generic algorithm. The ideas described could be used to parallelize other algorithms as well
Fast frequent pattern mining.
Yabo Xu.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 57-60).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Frequent Pattern Mining --- p.1Chapter 1.2 --- Biosequence Pattern Mining --- p.2Chapter 1.3 --- Organization of the Thesis --- p.4Chapter 2 --- PP-Mine: Fast Mining Frequent Patterns In-Memory --- p.5Chapter 2.1 --- Background --- p.5Chapter 2.2 --- The Overview --- p.6Chapter 2.3 --- PP-tree Representations and Its Construction --- p.7Chapter 2.4 --- PP-Mine --- p.8Chapter 2.5 --- Discussions --- p.14Chapter 2.6 --- Performance Study --- p.15Chapter 3 --- Fast Biosequence Patterns Mining --- p.20Chapter 3.1 --- Background --- p.21Chapter 3.1.1 --- Differences in Biosequences --- p.21Chapter 3.1.2 --- Mining Sequential Patterns --- p.22Chapter 3.1.3 --- Mining Long Patterns --- p.23Chapter 3.1.4 --- Related Works in Bioinformatics --- p.23Chapter 3.2 --- The Overview --- p.24Chapter 3.2.1 --- The Problem --- p.24Chapter 3.2.2 --- The Overview of Our Approach --- p.25Chapter 3.3 --- The Segment Phase --- p.26Chapter 3.3.1 --- Finding Frequent Segments --- p.26Chapter 3.3.2 --- The Index-based Querying --- p.27Chapter 3.3.3 --- The Compression-based Querying --- p.30Chapter 3.4 --- The Pattern Phase --- p.32Chapter 3.4.1 --- The Pruning Strategies --- p.34Chapter 3.4.2 --- The Querying Strategies --- p.37Chapter 3.5 --- Experiment --- p.40Chapter 3.5.1 --- Synthetic Data Sets --- p.40Chapter 3.5.2 --- Biological Data Sets --- p.46Chapter 4 --- Conclusion --- p.55Bibliography --- p.6
String Matching with Variable Length Gaps
We consider string matching with variable length gaps. Given a string and
a pattern consisting of strings separated by variable length gaps
(arbitrary strings of length in a specified range), the problem is to find all
ending positions of substrings in that match . This problem is a basic
primitive in computational biology applications. Let and be the lengths
of and , respectively, and let be the number of strings in . We
present a new algorithm achieving time and space , where is the sum of the lower bounds of the lengths of the gaps in
and is the total number of occurrences of the strings in
within . Compared to the previous results this bound essentially achieves
the best known time and space complexities simultaneously. Consequently, our
algorithm obtains the best known bounds for almost all combinations of ,
, , , and . Our algorithm is surprisingly simple and
straightforward to implement. We also present algorithms for finding and
encoding the positions of all strings in for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201
- …