Search CORE

45 research outputs found

Finding and counting vertex-colored subtrees

Author: A. Björklund
A.M. Ambalath
E. Alm
F. Hüffner
Florian Sikora
G. Blin
I. Koutis
I. Koutis
J. Flum
J. Flum
J. Nederlof
M.R. Fellows
N. Alon
N. Betzler
R. Dondi
R. Dondi
R. Karp
R. Sharan
R. Williams
S. Bruckner
S. Böcker
S. Guillemot
S. Schbath
Sylvain Guillemot
V. Arvind
V. Lacroix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/02/2012
Field of study

The problems studied in this article originate from the Graph Motif problem introduced by Lacroix et al. in the context of biological networks. The problem is to decide if a vertex-colored graph has a connected subgraph whose colors equal a given multiset of colors

M

. It is a graph pattern-matching problem variant, where the structure of the occurrence of the pattern is not of interest but the only requirement is the connectedness. Using an algebraic framework recently introduced by Koutis et al., we obtain new FPT algorithms for Graph Motif and variants, with improved running times. We also obtain results on the counting versions of this problem, proving that the counting problem is FPT if M is a set, but becomes W[1]-hard if M is a multiset with two colors. Finally, we present an experimental evaluation of this approach on real datasets, showing that its performance compares favorably with existing software.Comment: Conference version in International Symposium on Mathematical Foundations of Computer Science (MFCS), Brno : Czech Republic (2010) Journal Version in Algorithmic

arXiv.org e-Print Archive

Crossref

String Matching and 1d Lattice Gases

Author: A. D. Barbour
A. Dembo
B. Prum
D. Achlioptas
D. E. Knuth
E. Rivals
F. Gürsey
G. E. Uhlenbeck
G. Reinert
H. Harborth
H. S. Wilf
I. Fudos
I. Z. Fisher
J. Kleffe
Jane F. Gentleman
L. Goldstein
L. J. Guibas
L. J. Guibas
L. J. Guibas
M. Mézard
M. Régnier
M. Régnier
M. S. Waterman
M. X. Geske
Muhittin Mungan
O. Chrysaphinou
O. Chrysaphinou
O. Chrysaphinou
P. Pevzner
R. Monasson
S. B. Boyer
S. Karlin
S. Kirkpatrick
S. Robin
S. Robin
S. Robin
S. Schbath
W. Feller
Y. Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/08/2005
Field of study

We calculate the probability distributions for the number of occurrences

n

of a given

l

letter word in a random string of

k

letters. Analytical expressions for the distribution are known for the asymptotic regimes (i)

k \gg r^l \gg 1

(Gaussian) and

k,l \to \infty

such that

k/r^l

is finite (Compound Poisson). However, it is known that these distributions do now work well in the intermediate regime

k \gtrsim r^l \gtrsim 1

. We show that the problem of calculating the string matching probability can be cast into a determining the configurational partition function of a 1d lattice gas with interacting particles so that the matching probability becomes the grand-partition sum of the lattice gas, with the number of particles corresponding to the number of matches. We perform a virial expansion of the effective equation of state and obtain the probability distribution. Our result reproduces the behavior of the distribution in all regimes. We are also able to show analytically how the limiting distributions arise. Our analysis builds on the fact that the effective interactions between the particles consist of a relatively strong core of size

l

, the word length, followed by a weak, exponentially decaying tail. We find that the asymptotic regimes correspond to the case where the tail of the interactions can be neglected, while in the intermediate regime they need to be kept in the analysis. Our results are readily generalized to the case where the random strings are generated by more complicated stochastic processes such as a non-uniform letter probability distribution or Markov chains. We show that in these cases the tails of the effective interactions can be made even more dominant rendering thus the asymptotic approximations less accurate in such a regime.Comment: 44 pages and 8 figures. Major revision of previous version. The lattice gas analogy has been worked out in full, including virial expansion and equation of state. This constitutes the main part of the paper now. Connections with existing work is made and references should be up to date now. To be submitted for publicatio

arXiv.org e-Print Archive

Crossref

SIGffRid: A tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics

Abstract Background Many programs have been developed to identify transcription factor binding sites. However, most of them are not able to infer two-word motifs with variable spacer lengths. This case is encountered for RNA polymerase Sigma (<it>σ</it>) Factor Binding Sites (SFBSs) usually composed of two boxes, called -35 and -10 in reference to the transcription initiation point. Our goal is to design an algorithm detecting SFBS by using combinational and statistical constraints deduced from biological observations. Results We describe a new approach to identify SFBSs by comparing two related bacterial genomes. The method, named SIGffRid (SIGma Factor binding sites Finder using R'MES to select Input Data), performs a simultaneous analysis of pairs of promoter regions of orthologous genes. SIGffRid uses a prior identification of over-represented patterns in whole genomes as selection criteria for potential -35 and -10 boxes. These patterns are then grouped using pairs of short seeds (of which one is possibly gapped), allowing a variable-length spacer between them. Next, the motifs are extended guided by statistical considerations, a feature that ensures a selection of motifs with statistically relevant properties. We applied our method to the pair of related bacterial genomes of <it>Streptomyces coelicolor </it>and <it>Streptomyces avermitilis</it>. Cross-check with the well-defined SFBSs of the SigR regulon in <it>S. coelicolor </it>is detailed, validating the algorithm. SFBSs for HrdB and BldN were also found; and the results suggested some new targets for these <it>σ </it>factors. In addition, consensus motifs for BldD and new SFBSs binding sites were defined, overlapping previously proposed consensuses. Relevant tests were carried out also on bacteria with moderate GC content (i.e. <it>Escherichia coli</it>/<it>Salmonella typhimurium </it>and <it>Bacillus subtilis</it>/<it>Bacillus licheniformis </it>pairs). Motifs of house-keeping <it>σ </it>factors were found as well as other SFBSs such as that of SigW in <it>Bacillus </it>strains. Conclusion We demonstrate that our approach combining statistical and biological criteria was successful to predict SFBSs. The method versatility autorizes the recognition of other kinds of two-box regulatory sites.</p

HAL - Lille 3

Crossref

Directory of Open Access Journals

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

Queensland University of Technology ePrints Archive

Swinburne Research Bank

ProdInra

SIGffRid: A tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics

HAL - Lille 3

Crossref

Directory of Open Access Journals

INRIA a CCSD electronic archive server

PubMed Central

HAL Descartes

ProdInra

Genome-scale phylogenetic and DNA composition analyses of Antarctic Pseudoalteromonas bacteria reveal inconsistencies in current taxonomic affiliation

Author: Alain Filloux
Angelina Lo Giudice
B Wilmes
BJ McCarthy
C Holmström
C Médigue
CL Schildkraut
Donatella de Pascale
DT Pride
E Parrilli
E Stackebrandt
Elena Perrin
Emanuele Bosi
Ermenegilda Parrilli
F Pardi
G Feller
G Gauthier
H Teeling
H Teeling
Isabel Maida
J Goris
JD McAuliffe
JL Corchero
K Geuten
K Tamura
LG Wayne
M Galardini
M Giuliani
M Kim
M Richter
M Yu
MA Larkin
Marco Fondi
Maria Luisa Tutino
MC Papaleo
MC Papaleo
P Vandamme
PJ Cock
R Papa
R Rosselló-Móra
Renato Fani
S Egan
S Karlin
S Karlin
S Schbath
SR Eddy
T Kobayashi
V Rippa
Y Paitan
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics

Author: Amy N Wetzel
AN Desai
B Langmead
Benjamin J Kelly
C Gonzaga-Jauregui
CD Warden
CJ Saunders
D Marin
David L Newsom
Donald J Corsmeier
DP Rodgers
E Afgan
ER Mardis
F Lescai
GA Auwera Van der
GG Faust
GR Abecasis
H Li
H Li
Huachun Zhong
HY Lam
James R Fitch
JG Reid
JM Zook
JR Collins-Underwood
LD Stein
MA Depristo
MJ Puckelwartz
Peter White
PJ Cock
R Nielsen
Russell D Nordquist
S Schbath
SA Forbes
TF Smith
The Boston Children’s Hospital CLARITY Challenge Consortium
US Evani
WJ Youden
Yangqiu Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Waiting times for clumps of patterns and for structured motifs in random sequences

Author: Robin S.
Schbath S.
Stefanov V.T.
Publication venue: Elsevier B.V.
Publication date: 01/04/2007
Field of study

AbstractThis paper provides exact probability results for waiting times associated with occurrences of two types of motifs in a random sequence. First, we provide an explicit expression for the probability generating function of the interarrival time between two clumps of a pattern. It allows, in particular, to measure the quality of the Poisson approximation which is currently used for evaluation of the distribution of the number of clumps of a pattern. Second, we provide explicit expressions for the probability generating functions of both the waiting time until the first occurrence, and the interarrival time between consecutive occurrences, of a structured motif. Distributional results for structured motifs are of interest in genome analysis because such motifs are promoter candidates. As an application, we determine significant structured motifs in a data set of DNA regulatory sequences

Elsevier - Publisher Connector

Probabilistic and Statistical Properties of Words: An Overview

Author: Gesine Reinert
Michael S. Waterman
Sophie Schbath
Publication venue
Publication date: 01/01/2000
Field of study

In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein’s method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, con � dence intervals for tests. Key words: word counts, renewal counts, Markov model, exact distribution, normal approximation, Poisson process approximation, compound Poisson approximation, occurrences of multiple words, sequencing by hybridization, martingales, moment generating functions, Stein’s method, Chen-Stein method. 1

CiteSeerX

HAL Descartes

Hal-Diderot