Search CORE

192 research outputs found

Recommended from our members

A constraint based structure description language for Biosequences

Author: Eidhammer I
Gilbert D
Grindhaug SH
Jonassen J
Ratnayake R
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2001
Field of study

Brunel University Research Archive

Identifying statistical dependence in genomic sequences via mutual information estimates

Author: Aktulga HM
Grama AY
Kontoyiannis I
Lyznik LA
Szpankowski L
Szpankowski W
Publication venue
Publication date: 01/01/2007
Field of study

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CUED - Cambridge University Engineering Department

Functional classification of G-Protein coupled receptors, based on their specific ligand coupling patterns

Author: Bakır Burcu
Sezerman Uğur
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2006
Field of study

Functional identification of G-Protein Coupled Receptors (GPCRs) is one of the current focus areas of pharmaceutical research. Although thousands of GPCR sequences are known, many of them re- main as orphan sequences (the activating ligand is unknown). Therefore, classification methods for automated characterization of orphan GPCRs are imperative. In this study, for predicting Level 2 subfamilies of Amine GPCRs, a novel method for obtaining fixed-length feature vectors, based on the existence of activating ligand specific patterns, has been developed and utilized for a Support Vector Machine (SVM)-based classification. Exploiting the fact that there is a non-promiscuous relationship between the specific binding of GPCRs into their ligands and their functional classification, our method classifies Level 2 subfamilies of Amine GPCRs with a high predictive accuracy of 97.02% in a ten-fold cross validation test. The presented machine learning approach, bridges the gulf between the excess amount of GPCR sequence data and their poor functional characterization

Sabanci University Research Database

Recommended from our members

Topology-based protein structure comparison using a pattern discovery technique

Author: Gilbert D
Thornton J
Viksna J
Westhead D
Publication venue: 'University of Birmingham Library Services'
Publication date: 01/01/2000
Field of study

Brunel University Research Archive

Parallel Pattern Discovery

Author: Elbre Egon
Publication venue: Tartu Ülikool
Publication date: 01/01/2013
Field of study

Üks huvitav uurimisprobleem andmete analüüsimisel on mustriotsing. Mustrid võivad näidata kuidas andmed on tekkinud ja kuidas ta ennast kordab. Andmete mahu kiire kasvamise tõttu on vajadus algoritmidele, mis skaleeruvad mitmele protsessile. Selles töös me uurime kuidas paralleliseerida olemasolevat algoritmi kasutades kolme ideed: üldistamine, liigendamine ja reifitseerimine. Me rakendame neid ideid SPEXS-il, mustriotsingu algoritm, ning tuletame paralleelse algoritmi SPEXS2, mille me ka implementeerime. Lisaks me uurime probleeme, mis tekkisid selle algoritmi implementeerimisel. Selles töös tutvustatud ideid saab kasutada teiste algoritmide üldistamisel ning paralleliseerimisel.An interesting research problem in dataset analysis is the discovery of patterns. Patterns can show how the dataset was formed and how it repeats itself. Due to the fast growth of data collection there is a need for algorithms that can scale with the data. In this thesis we examine how we can take an existing algorithm and make it parallel with three ideas: generalization, decomposition and reification of the existing algorithm. We apply these ideas to SPEXS, a pattern discovery algorithm, and generate a new algorithm SPEXS2, which we also implement. We also analyze several problems when implementing a generic algorithm. The ideas described could be used to parallelize other algorithms as well

DSpace at Tartu University Library

Fast frequent pattern mining.

Author
Publication venue
Publication date: 01/01/2003
Field of study

Yabo Xu.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 57-60).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Frequent Pattern Mining --- p.1Chapter 1.2 --- Biosequence Pattern Mining --- p.2Chapter 1.3 --- Organization of the Thesis --- p.4Chapter 2 --- PP-Mine: Fast Mining Frequent Patterns In-Memory --- p.5Chapter 2.1 --- Background --- p.5Chapter 2.2 --- The Overview --- p.6Chapter 2.3 --- PP-tree Representations and Its Construction --- p.7Chapter 2.4 --- PP-Mine --- p.8Chapter 2.5 --- Discussions --- p.14Chapter 2.6 --- Performance Study --- p.15Chapter 3 --- Fast Biosequence Patterns Mining --- p.20Chapter 3.1 --- Background --- p.21Chapter 3.1.1 --- Differences in Biosequences --- p.21Chapter 3.1.2 --- Mining Sequential Patterns --- p.22Chapter 3.1.3 --- Mining Long Patterns --- p.23Chapter 3.1.4 --- Related Works in Bioinformatics --- p.23Chapter 3.2 --- The Overview --- p.24Chapter 3.2.1 --- The Problem --- p.24Chapter 3.2.2 --- The Overview of Our Approach --- p.25Chapter 3.3 --- The Segment Phase --- p.26Chapter 3.3.1 --- Finding Frequent Segments --- p.26Chapter 3.3.2 --- The Index-based Querying --- p.27Chapter 3.3.3 --- The Compression-based Querying --- p.30Chapter 3.4 --- The Pattern Phase --- p.32Chapter 3.4.1 --- The Pruning Strategies --- p.34Chapter 3.4.2 --- The Querying Strategies --- p.37Chapter 3.5 --- Experiment --- p.40Chapter 3.5.1 --- Synthetic Data Sets --- p.40Chapter 3.5.2 --- Biological Data Sets --- p.46Chapter 4 --- Conclusion --- p.55Bibliography --- p.6

CUHK Digital Repository

String Matching with Variable Length Gaps

Author: Aho
Crochemore
David Kofoed Wind
Fredriksson
Hjalte Wedel Vildhøj
Hofmann
Inge Li Gørtz
Knuth
Morgante
Myers
Myers
Myers
Navarro
Navarro
Philip Bille
Thompson
Publication venue
Publication date: 01/01/2010
Field of study

We consider string matching with variable length gaps. Given a string

T

and a pattern

P

consisting of strings separated by variable length gaps (arbitrary strings of length in a specified range), the problem is to find all ending positions of substrings in

T

that match

P

. This problem is a basic primitive in computational biology applications. Let

m

and

n

be the lengths of

P

and

T

, respectively, and let

k

be the number of strings in

P

. We present a new algorithm achieving time

O(n\log k + m +\alpha)

and space

O(m + A)

, where

A

is the sum of the lower bounds of the lengths of the gaps in

P

and

\alpha

is the total number of occurrences of the strings in

P

within

T

. Compared to the previous results this bound essentially achieves the best known time and space complexities simultaneously. Consequently, our algorithm obtains the best known bounds for almost all combinations of

m

n

k

A

, and

\alpha

. Our algorithm is surprisingly simple and straightforward to implement. We also present algorithms for finding and encoding the positions of all strings in

P

for every match of the pattern.Comment: draft of full version, extended abstract at SPIRE 201

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Online Research Database In Technology