Search CORE

85 research outputs found

Computing Individual Risks based on Family History in Genetic Disease in the Presence of Competing Risks

Author: Bouaziz O
Lefebvre Antoine
Nuel G
Publication venue
Publication date: 14/09/2017
Field of study

When considering a genetic disease with variable age at onset (ex: diabetes , familial amyloid neuropathy, cancers, etc.), computing the individual risk of the disease based on family history (FH) is of critical interest both for clinicians and patients. Such a risk is very challenging to compute because: 1) the genotype X of the individual of interest is in general unknown; 2) the posterior distribution P(X|FH, T > t) changes with t (T is the age at disease onset for the targeted individual); 3) the competing risk of death is not negligible. In this work, we present a modeling of this problem using a Bayesian network mixed with (right-censored) survival outcomes where hazard rates only depend on the genotype of each individual. We explain how belief propagation can be used to obtain posterior distribution of genotypes given the FH, and how to obtain a time-dependent posterior hazard rate for any individual in the pedigree. Finally, we use this posterior hazard rate to compute individual risk, with or without the competing risk of death. Our method is illustrated using the Claus-Easton model for breast cancer (BC). This model assumes an autosomal dominant genetic risk factor such as non-carriers (genotype 00) have a BC hazard rate

\lambda

0 (t) while carriers (genotypes 01, 10 and 11) have a (much greater) hazard rate

\lambda

1 (t). Both hazard rates are assumed to be piecewise constant with known values (cuts at 20, 30,. .. , 80 years). The competing risk of death is derived from the national French registry

arXiv.org e-Print Archive

Directory of Open Access Journals

HAL Descartes

Hal-Diderot

Moments of the Count of a Regular Expression in a Heterogeneous Random Sequence

Author: Nuel G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2019
Field of study

International audienceWe focus here on the distribution of the random count N of a regular expression in a multi-state random sequence generated by a heterogenous Markov source. We first briefly recall how classical Markov chain embedding techniques allow reducing the problem to the count of specific transitions in a (heterogenous) order 1 Markov chain over a deterministic finite automaton state space. From this result we derive the expression of both the mgf/pgf of N as well as the factorial and non-factorial moments of N. We then introduce the notion of evidence-based constraints in this context. Following the classical forward/backward algorithm in hidden Markov models, we provide explicit recursions allowing to compute the mgf/pgf of N under the evidence constraint. All the results presented are illustrated with a toy example

Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics

Author: C Chothia
G Nuel
G Nuel
Grégory Nuel
J Kyte
JC Fu
JC Fu
JC Fu
JE Hopcroft
RB Bapat
S Karlin
S Mercier
S Mercier
S Robin
S Robin
SF Altschul
SF Altschul
WH Press
WYW Lou
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

The technique of Finite Markov Chain Imbedding (FMCI) is a classical approach to complex combinatorial problems related to sequences. In order to get efficient algorithms, it is known that such approaches need to be first rewritten using recursive relations. We propose here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF). These algorithms are then applied in two particular cases: the local score of one sequence and pattern statistics. In both cases, asymptotic developments are derived. For the local score, our new approach allows for the very first time to compute exact p-values for a practical study (finding hydrophobic segments in a protein database) where only approximations were available before. In this study, the asymptotic approximations appear to be completely unreliable for 99.5% of the considered sequences. Concerning the pattern statistics, the new FMCI algorithms dramatically outperform the previous ones as they are more reliable, easier to implement, faster and with lower memory requirements

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

Author: A Bairoch
A Denise
AC Camproux
Anne-Claude Camproux
AP Godbole
B Prum
C Gautier
DA Benson
DL Antzoulakos
E Rocha
G Churchill
G Nuel
G Nuel
G Nuel
G Nuel
G Nuel
G Nuelg
G Reinert
G Reinert
GD Stormo
Gregory Nuel
J Becq
J Do
J Fu
J Kleffe
J Martin
J Van Helden
JAD Aston
JC Fu
JC Fu
JE Hopcroft
JM Claverie
Juliette Martin
JW Fickett
K Liolios
L Regad
Leslie Regad
M Crochemore
M Reignier
M Thomas-Chollier
MC Frith
ME Lladser
MX Geske
MY Leung
N Hulo
P Nicodème
P Nicolas
P Pevzner
P Ribeca
R Cowan
S Karlin
S Sourice
T Erhardsson
V Boeva
V Boeva
V Stefanov
V Stefanov
VT Stefanov
YM Chang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. Results The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. Conclusions Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p

HAL Evry

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Mining protein loops using a structural alphabet and statistical exceptionality

Author: A Dembo
A Efimov
A Golovin
A Sacan
A Via
AC Camproux
AC Camproux
AC Camproux
Anne-Claude Camproux
AR Panchenko
AR Panchenko
B Oliva
BJ Polacco
BL Sibanda
BL Sibanda
BL Sibanda
BW Matthews
C Kiss
CG Hunter
CM Venkatachalam
D Leader
D Stuart
DF Burke
E Rocha
EG Hutchinson
EJ Milner-White
EJ Milner-White
F den Hollander
G Ausiello
G Ausiello
G Nuel
G Nuel
G Nuel
G Pugalenthi
GD Rose
Gregory Nuel
J Espadaler
J Martin
J Martin
J van Helden
J Wojcik
JF Leszczynski
JM Kwasigroch
JS Fetrow
JS Richardson
Juliette Martin
JW Sammon
JW Torrance
KC Chou
L Regad
LE Donate
Leslie Regad
LN Johnson
LR Rabiner
LS Bernstein
M Hollander
M Mönnigmann
M Saraste
MY Leung
N Colloc'h
N Fernandez-Fuentes
N Fernandez-Fuentes
O Sander
P Fuchs
PA Rice
PN Lewis
R Kolodny
S Karlin
S Kim
S Kullback
S Sourice
SA Benner
SA Benner
SD Rufino
V Pavone
W Kabsch
W Li
W Li
WL DeLano
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. Results We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. Conclusions We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Future Contingents and the Logic of Temporal Omniscience

At least since Aristotle’s famous 'sea-battle' passages in On Interpretation 9, some substantial minority of philosophers has been attracted to the doctrine of the open future--the doctrine that future contingent statements are not true. But, prima facie, such views seem inconsistent with the following intuition: if something has happened, then (looking back) it was the case that it would happen. How can it be that, looking forwards, it isn’t true that there will be a sea battle, while also being true that, looking backwards, it was the case that there would be a sea battle? This tension forms, in large part, what might be called the problem of future contingents. A dominant trend in temporal logic and semantic theorizing about future contingents seeks to validate both intuitions. Theorists in this tradition--including some interpretations of Aristotle, but paradigmatically, Thomason (1970), as well as more recent developments in Belnap, et. al (2001) and MacFarlane (2003, 2014)--have argued that the apparent tension between the intuitions is in fact merely apparent. In short, such theorists seek to maintain both of the following two theses: (i) the open future: Future contingents are not true, and (ii) retro-closure: From the fact that something is true, it follows that it was the case that it would be true. It is well-known that reflection on the problem of future contingents has in many ways been inspired by importantly parallel issues regarding divine foreknowledge and indeterminism. In this paper, we take up this perspective, and ask what accepting both the open future and retro-closure predicts about omniscience. When we theorize about a perfect knower, we are theorizing about what an ideal agent ought to believe. Our contention is that there isn’t an acceptable view of ideally rational belief given the assumptions of the open future and retro-closure, and thus this casts doubt on the conjunction of those assumptions

PhilPapers

Crossref

Should We Abandon the t-Test in the Analysis of Gene Expression Microarray Data: A Comparison of Variance Modeling Strategies

Author: Aurelien de Reynies
B Wu
C Kooperberg
C Murie
C Yauk
Caroline Paccard
D Allison
D Chessel
D Rickman
F Jaffrezic
G Marot
G Smyth
G Wright
Gregory Nuel
I Jeffery
J Soulier
JD Storey
Kerby Shedden
L Lamant
L Van 't Veer
L Zhou
Laetitia Marisa
M Kerr
M McCall
M Pirooznia
M Sullivan Pepe
Marine Jeanmougin
Mickael Guedj
N Jain
P Bertheau
P Delmar
R Simon
S Boyault
S Dudoit
S Zhang
T Mary-Huard
T Sorlie
V Tusher
X Huang
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

High-throughput post-genomic studies are now routinely and promisingly investigated in biological and biomedical research. The main statistical approach to select genes differentially expressed between two groups is to apply a t-test, which is subject of criticism in the literature. Numerous alternatives have been developed based on different and innovative variance modeling strategies. However, a critical issue is that selecting a different test usually leads to a different gene list. In this context and given the current tendency to apply the t-test, identifying the most efficient approach in practice remains crucial. To provide elements to answer, we conduct a comparison of eight tests representative of variance modeling strategies in gene expression data: Welch's t-test, ANOVA [1], Wilcoxon's test, SAM [2], RVM [3], limma [4], VarMixt [5] and SMVar [6]. Our comparison process relies on four steps (gene list analysis, simulations, spike-in data and re-sampling) to formulate comprehensive and robust conclusions about test performance, in terms of statistical power, false-positive rate, execution time and ease of use. Our results raise concerns about the ability of some methods to control the expected number of false positives at a desirable level. Besides, two tests (limma and VarMixt) show significant improvement compared to the t-test, in particular to deal with small sample sizes. In addition limma presents several practical advantages, so we advocate its application to analyze gene expression data

CiteSeerX

Public Library of Science (PLOS)

HAL Evry

Crossref

Directory of Open Access Journals

PubMed Central

HAL Descartes

ProdInra