Search CORE

White Rose Research Online

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

Author: A Bairoch
A Denise
AC Camproux
Anne-Claude Camproux
AP Godbole
B Prum
C Gautier
DA Benson
DL Antzoulakos
E Rocha
G Churchill
G Nuel
G Nuel
G Nuel
G Nuel
G Nuel
G Nuelg
G Reinert
G Reinert
GD Stormo
Gregory Nuel
J Becq
J Do
J Fu
J Kleffe
J Martin
J Van Helden
JAD Aston
JC Fu
JC Fu
JE Hopcroft
JM Claverie
Juliette Martin
JW Fickett
K Liolios
L Regad
Leslie Regad
M Crochemore
M Reignier
M Thomas-Chollier
MC Frith
ME Lladser
MX Geske
MY Leung
N Hulo
P Nicodème
P Nicolas
P Pevzner
P Ribeca
R Cowan
S Karlin
S Sourice
T Erhardsson
V Boeva
V Boeva
V Stefanov
V Stefanov
VT Stefanov
YM Chang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Conclusions Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p

HAL Evry

Springer - Publisher Connector

Mining protein loops using a structural alphabet and statistical exceptionality

Author: A Dembo
A Efimov
A Golovin
A Sacan
A Via
AC Camproux
AC Camproux
AC Camproux
Anne-Claude Camproux
AR Panchenko
AR Panchenko
B Oliva
BJ Polacco
BL Sibanda
BL Sibanda
BL Sibanda
BW Matthews
C Kiss
CG Hunter
CM Venkatachalam
D Leader
D Stuart
DF Burke
E Rocha
EG Hutchinson
EJ Milner-White
EJ Milner-White
F den Hollander
G Ausiello
G Ausiello
G Nuel
G Nuel
G Nuel
G Pugalenthi
GD Rose
Gregory Nuel
J Espadaler
J Martin
J Martin
J van Helden
J Wojcik
JF Leszczynski
JM Kwasigroch
JS Fetrow
JS Richardson
Juliette Martin
JW Sammon
JW Torrance
KC Chou
L Regad
LE Donate
Leslie Regad
LN Johnson
LR Rabiner
LS Bernstein
M Hollander
M Mönnigmann
M Saraste
MY Leung
N Colloc'h
N Fernandez-Fuentes
N Fernandez-Fuentes
O Sander
P Fuchs
PA Rice
PN Lewis
R Kolodny
S Karlin
S Kim
S Kullback
S Sourice
SA Benner
SA Benner
SD Rufino
V Pavone
W Kabsch
W Li
W Li
WL DeLano
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. Results We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. Conclusions We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p

Springer - Publisher Connector

Indeterminism is a modal notion: branching spacetimes and Earman’s pruning

Author: A. N. Prior
D. Lewis
G. S. Hall
J. Bennett
J. D. Norton
J. Earman
J. Earman
J. MacFarlane
K. R. Popper
L. Wroński
M. Weiner
M. Xu
N. Belnap
N. Belnap
N. Belnap
N. Belnap
N. Belnap
N. Belnap
Nuel Belnap
P. S. Laplace
P. Øhrstrøm
P. Øhrstrøm
R. H. Thomason
R. Haag
T. Müller
T. Müller
T. Müller
T. Placek
T. Placek
T. Placek
T. Placek
Tomasz Placek
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

An analysis of single amino acid repeats as use case for application specific background models

Author: C Notredame
David P Kreil
DP Depledge
DP Kreil
E Birney
E Delot
EL Sonnhammer
EM Marcotte
G Gouridis
G Nuel
G Reinert
H Gerber
H Nielsen
H Nielsen
IB Kuznetsov
J Thompson
J Wootton
J Xie
JD Bendtsen
JM Hancock
JW Fondon
L Brown
L Zhang
M Hoebeke
M Mar Alba
M Thomas-Chollier
M Tipping
M Tipping
MA Huntley
O Weiss
OB Ptitsyn
P Siwach
P Siwach
Paweł P Łabaj
Peter Sykacek
PP Łabaj
R Lopez
R Lyne
RI Sadreyev
RS Hegde
S Caburet
S Hands
S Henikoff
S Karlin
S Karlin
SF Altschul
SF Altschul
SF Altschul
T Koestler
VJ Promponas
VR Chechetkin
VS Pande
WR Pearson
Y Kashi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background Sequence analysis aims to identify biologically relevant signals against a backdrop of functionally meaningless variation. Increasingly, it is recognized that the quality of the background model directly affects the performance of analyses. State-of-the-art approaches rely on classical sequence models that are adapted to the studied dataset. Although performing well in the analysis of globular protein domains, these models break down in regions of stronger compositional bias or low complexity. While these regions are typically filtered, there is increasing anecdotal evidence of functional roles. This motivates an exploration of more complex sequence models and application-specific approaches for the investigation of biased regions. Results Traditional Markov-chains and application-specific regression models are compared using the example of predicting runs of single amino acids, a particularly simple class of biased regions. Cross-fold validation experiments reveal that the alternative regression models capture the multi-variate trends well, despite their low dimensionality and in contrast even to higher-order Markov-predictors. We show how the significance of unusual observations can be computed for such empirical models. The power of a dedicated model in the detection of biologically interesting signals is then demonstrated in an analysis identifying the unexpected enrichment of contiguous leucine-repeats in signal-peptides. Considering different reference sets, we show how the question examined actually defines what constitutes the 'background'. Results can thus be highly sensitive to the choice of appropriate model training sets. Conversely, the choice of reference data determines the questions that can be investigated in an analysis. Conclusions Using a specific case of studying biased regions as an example, we have demonstrated that the construction of application-specific background models is both necessary and feasible in a challenging sequence analysis situation

Springer - Publisher Connector

Publikationsserver der Universitätsbibliothek Bodenkultur Wien

Warwick Research Archives Portal Repository

Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies

Author: Aurelien de Reynies
B Wu
C Kooperberg
C Murie
C Yauk
Caroline Paccard
D Allison
D Chessel
D Rickman
F Jaffrezic
G Marot
G Smyth
G Wright
Gregory Nuel
I Jeffery
J Soulier
JD Storey
Kerby Shedden
L Lamant
L Van 't Veer
L Zhou
Laetitia Marisa
M Kerr
M McCall
M Pirooznia
M Sullivan Pepe
Marine Jeanmougin
Mickael Guedj
N Jain
P Bertheau
P Delmar
R Simon
S Boyault
S Dudoit
S Zhang
T Mary-Huard
T Sorlie
V Tusher
X Huang
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

High-throughput post-genomic studies are now routinely and promisingly investigated in biological and biomedical research. The main statistical approach to select genes differentially expressed between two groups is to apply a t-test, which is subject of criticism in the literature. Numerous alternatives have been developed based on different and innovative variance modeling strategies. However, a critical issue is that selecting a different test usually leads to a different gene list. In this context and given the current tendency to apply the t-test, identifying the most efficient approach in practice remains crucial. To provide elements to answer, we conduct a comparison of eight tests representative of variance modeling strategies in gene expression data: Welch's t-test, ANOVA [1], Wilcoxon's test, SAM [2], RVM [3], limma [4], VarMixt [5] and SMVar [6]. Our comparison process relies on four steps (gene list analysis, simulations, spike-in data and re-sampling) to formulate comprehensive and robust conclusions about test performance, in terms of statistical power, false-positive rate, execution time and ease of use. Our results raise concerns about the ability of some methods to control the expected number of false positives at a desirable level. Besides, two tests (limma and VarMixt) show significant improvement compared to the t-test, in particular to deal with small sample sizes. In addition limma presents several practical advantages, so we advocate its application to analyze gene expression data

CiteSeerX

Public Library of Science (PLOS)

HAL Evry

Public Library of Science (PLOS)

HAL Descartes

ProdInra

Deciphering Normal Blood Gene Expression Variation—The NOWAC Postgenome Study

There is growing evidence that gene expression profiling of peripheral blood cells is a valuable tool for assessing gene signatures related to exposure, drug-response, or disease. However, the true promise of this approach can not be estimated until the scientific community has robust baseline data describing variation in gene expression patterns in normal individuals. Using a large representative sample set of postmenopausal women (N = 286) in the Norwegian Women and Cancer (NOWAC) postgenome study, we investigated variability of whole blood gene expression in the general population. In particular, we examined changes in blood gene expression caused by technical variability, normal inter-individual differences, and exposure variables at proportions and levels relevant to real-life situations. We observe that the overall changes in gene expression are subtle, implying the need for careful analytic approaches of the data. In particular, technical variability may not be ignored and subsequent adjustments must be considered in any analysis. Many new candidate genes were identified that are differentially expressed according to inter-individual (i.e. fasting, BMI) and exposure (i.e. smoking) factors, thus establishing that these effects are mirrored in blood. By focusing on the biological implications instead of directly comparing gene lists from several related studies in the literature, our analytic approach was able to identify significant similarities and effects consistent across these reports. This establishes the feasibility of blood gene expression profiling, if they are predicated upon careful experimental design and analysis in order to minimize confounding signals, artifacts of sample preparation and processing, and inter-individual differences

Munin - Open Research Archive

HAL Descartes

NORA - Norwegian Open Research Archives

Genome wide linkage study, using a 250K SNP map, of Plasmodium falciparum infection and mild malaria attack in a Senegalese population

Author: A Garcia
A Garcia
A Jepson
A Matsuzawa
A Sakuntabhai
A Spiegel
AL Boyles
André Garcia
AV Hill
B Gyan
BC Schutte
C Bellenguez
C Bougeret
C Timmann
CC Kim
DA Meyers
David Courtin
DP Kwiatkowski
DP Mathanga
DR Nyholt
E Lander
F Canal
F Fillol
F Fillol
F Migot-Nabias
F Verra
FC Hartgers
FE McKenzie
Florence Migot-Nabias
FS Machado
Georges Snounou
GR Abecasis
GR Abecasis
GR Abecasis
Gregory Nuel
H Kikutani
I Chabbert-de Ponnat
J Li
J Little
J Wattavidanage
Jacqueline Milet
JC Barrett
JC Knight
JE Wigginton
JN Wilson
K Ghosh
L Almasy
L Flori
L Molineaux
L Romani
Laurence Watier
M Cot
M Hommel
M Jallow
M Mizui
M Nacher
M Otten
M Yamamoto
MJ Mackinnon
MR Comeau
N Thawani
N Valin
Oumar Gaye
P Rihet
P Zhang
Paul Senghor
PC Sham
PI de Bakker
RI Chima
RS Spielman
RW Snow
S Cabantous
S Horvath
S Males
S Marquet
S Minamoto
SJ Ceesay
SL Lake
SM Jeronimo
T Fukushima
TG Clark
TH Leu
TN Williams
TN Williams
V Briand
V Robert
W McGuire
W McGuire
X Peng
Yousri Slaoui
Z Taoufiq
Publication venue
Publication date: 01/01/2010
Field of study

Multiple factors are involved in the variability of host's response to P. falciparum infection, like the intensity and seasonality of malaria transmission, the virulence of parasite and host characteristics like age or genetic make-up. Although admitted nowadays, the involvement of host genetic factors remains unclear. Discordant results exist, even concerning the best-known malaria resistance genes that determine the structure or function of red blood cells. Here we report on a genomewide linkage and association study for P. falciparum infection intensity and mild malaria attack among a Senegalese population of children and young adults from 2 to 18 years old. A high density single nucleotide polymorphisms (SNP) genome scan (Affimetrix GeneChip Human Mapping 250K-nsp) was performed for 626 individuals: i.e. 249 parents and 377 children out of the 504 ones included in the follow-up. The population belongs to a unique ethnic group and was closely followed-up during 3 years. Genome-wide linkage analyses were performed on four clinical and parasitological phenotypes and association analyses using the family based association tests (FBAT) method were carried out in regions previously linked to malaria phenotypes in literature and in the regions for which we identified a linkage peak. Analyses revealed three strongly suggestive evidences for linkage: between mild malaria attack and both the 6p25.1 and the 12q22 regions (empirical p-value = 5 x 10(-5) and 96 x 10(-5) respectively), and between the 20p11q11 region and the prevalence of parasite density in asymptomatic children (empirical p-value = 1.5 x 10(-4)). Family based association analysis pointed out one significant association between the intensity of plasmodial infection and a polymorphism located in ARHGAP26 gene in the 5q31-q33 region (p-value = 3.7 x 10(-5)). This study identified three candidate regions, two of them containing genes that could point out new pathways implicated in the response to malaria infection. Furthermore, we detected one gene associated with malaria infection in the 5q31-q33 region

HAL - Normandie Université

Red de Bibliotecas Virtuales de Ciencias Sociales de América Latina y El Caribe

HAL Descartes

Horizon / Pleins textes

Hal-Diderot

CIFOL: Case-Intensional First Order Logic

Author: A Bressan
A Bressan
A Gibbard
A Gupta
D Gallin
D Wiggins
DK Lewis
EJ Lowe
GE Hughes
J Bacon
J Butterfield
J Garson
M Bugno
M Dummett
MC Fitting
N Belnap
Nuel Belnap
P Suppes
P Tichý
PT Geach
R Barcan
R Carnap
R Muskens
RH Thomason
RH Thomason
S Kripke
S Kripke
T Sider
T Williamson
Thomas Müller
W Quine
Z Parks
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Public Library of Science (PLOS)

ISL1 Directly Regulates FGF10 Transcription during Human Cardiac Outflow Formation

Author: A Marguerie
A Megarbane
A Moorman
A Rojas
A Sizarov
A Stolfi
AM Riazi
Arnold Munnich
B Thienpont
BM Riley
BS Snarr
C Golzio
C-L Cai
CA Reznikoff
Candice Babarit
Christelle Golzio
E Dodou
E Goldmuntz
E Goldmuntz
E Havis
E Rohmann
Emmanuelle Havis
EP Kirk
F Pasutto
F Vitelli
FA Stennard
FL Conlon
GG Loots
Gregory Nuel
H Lickert
H Min
H Ohuchi
H Ohuchi
H Sasaki
H Xu
Heather C. Etchevers
I Kostetskii
I Ovcharenko
I Sanchez-Garcia
J Hadchouel
J Klar
J Martinovic-Bouriel
JE VanderMeer
JI Hoffman
JK Takeuchi
JK Takeuchi
JS Waxman
K Lilleväli
K Sekine
KL McBride
KL Waldo
KN Stevens
L Pinson
L Ryckebusch
L Yang
LD Urness
LELM Vissers
M Blanchette
M Delous
M Entesarian
M Merika
M Vega-Hernández
M Vettese-Dadey
MG Posch
Michael Schubert
Michel Vekemans
MJ McCabe
P Agarwal
P Bouillet
Philippe Daubas
R Genead
R O'Rahilly
RF Arauz
RG Kelly
S Benko
S Yuan
SA Miller
SK Lee
SL Pfaff
Stanislas Lyonnet
Stéphane Zaffran
T Brade
T Fairbanks
TJ Desai
U Ahlgren
W Gong
W Herzog
W Liu
Y Kawakami
Y Tomita
Y Watanabe
Y Zhou
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

The LIM homeodomain gene Islet-1 (ISL1) encodes a transcription factor that has been associated with the multipotency of human cardiac progenitors, and in mice enables the correct deployment of second heart field (SHF) cells to become the myocardium of atria, right ventricle and outflow tract. Other markers have been identified that characterize subdomains of the SHF, such as the fibroblast growth factor Fgf10 in its anterior region. While functional evidence of its essential contribution has been demonstrated in many vertebrate species, SHF expression of Isl1 has been shown in only some models. We examined the relationship between human ISL1 and FGF10 within the embryonic time window during which the linear heart tube remodels into four chambers. ISL1 transcription demarcated an anatomical region supporting the conserved existence of a SHF in humans, and transcription factors of the GATA family were co-expressed therein. In conjunction, we identified a novel enhancer containing a highly conserved ISL1 consensus binding site within the FGF10 first intron. ChIP and EMSA demonstrated its direct occupation by ISL1. Transcription mediated by ISL1 from this FGF10 intronic element was enhanced by the presence of GATA4 and TBX20 cardiac transcription factors. Finally, transgenic mice confirmed that endogenous factors bound the human FGF10 intronic enhancer to drive reporter expression in the developing cardiac outflow tract. These findings highlight the interest of examining developmental regulatory networks directly in human tissues, when possible, to assess candidate non-coding regions that may be responsible for congenital malformations

HAL-Inserm