68 research outputs found

    Statistical potentials for evolutionary studies

    Full text link
    Les séquences protéiques naturelles sont le résultat net de l’interaction entre les mécanismes de mutation, de sélection naturelle et de dérive stochastique au cours des temps évolutifs. Les modèles probabilistes d’évolution moléculaire qui tiennent compte de ces différents facteurs ont été substantiellement améliorés au cours des dernières années. En particulier, ont été proposés des modèles incorporant explicitement la structure des protéines et les interdépendances entre sites, ainsi que les outils statistiques pour évaluer la performance de ces modèles. Toutefois, en dépit des avancées significatives dans cette direction, seules des représentations très simplifiées de la structure protéique ont été utilisées jusqu’à présent. Dans ce contexte, le sujet général de cette thèse est la modélisation de la structure tridimensionnelle des protéines, en tenant compte des limitations pratiques imposées par l’utilisation de méthodes phylogénétiques très gourmandes en temps de calcul. Dans un premier temps, une méthode statistique générale est présentée, visant à optimiser les paramètres d’un potentiel statistique (qui est une pseudo-énergie mesurant la compatibilité séquence-structure). La forme fonctionnelle du potentiel est par la suite raffinée, en augmentant le niveau de détails dans la description structurale sans alourdir les coûts computationnels. Plusieurs éléments structuraux sont explorés : interactions entre pairs de résidus, accessibilité au solvant, conformation de la chaîne principale et flexibilité. Les potentiels sont ensuite inclus dans un modèle d’évolution et leur performance est évaluée en termes d’ajustement statistique à des données réelles, et contrastée avec des modèles d’évolution standards. Finalement, le nouveau modèle structurellement contraint ainsi obtenu est utilisé pour mieux comprendre les relations entre niveau d’expression des gènes et sélection et conservation de leur séquence protéique.Protein sequences are the net result of the interplay of mutation, natural selection and stochastic variation. Probabilistic models of molecular evolution accounting for these processes have been substantially improved over the last years. In particular, models that explicitly incorporate protein structure and site interdependencies have recently been developed, as well as statistical tools for assessing their performance. Despite major advances in this direction, only simple representations of protein structure have been used so far. In this context, the main theme of this dissertation has been the modeling of three-dimensional protein structure for evolutionary studies, taking into account the limitations imposed by computationally demanding phylogenetic methods. First, a general statistical framework for optimizing the parameters of a statistical potential (an energy-like scoring system for sequence-structure compatibility) is presented. The functional form of the potential is then refined, increasing the detail of structural description without inflating computational costs. Always at the residue-level, several structural elements are investigated: pairwise distance interactions, solvent accessibility, backbone conformation and flexibility of the residues. The potentials are then included into an evolutionary model and their performance is assessed in terms of model fit, compared to standard evolutionary models. Finally, this new structurally constrained phylogenetic model is used to better understand the selective forces behind the differences in conservation found in genes of very different expression levels

    Fast optimization of statistical potentials for structurally constrained phylogenetic models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Statistical approaches for <it>protein design </it>are relevant in the field of molecular evolutionary studies. In recent years, new, so-called structurally constrained (<it>SC</it>) models of protein-coding sequence evolution have been proposed, which use statistical potentials to assess sequence-structure compatibility. In a previous work, we defined a statistical framework for optimizing knowledge-based potentials especially suited to SC models. Our method used the maximum likelihood principle and provided what we call the <it>joint </it>potentials. However, the method required numerical estimations by the use of computationally heavy <it>Markov Chain Monte Carlo </it>sampling algorithms.</p> <p>Results</p> <p>Here, we develop an alternative optimization procedure, based on a <it>leave-one-out </it>argument coupled to fast gradient descent algorithms. We assess that the leave-one-out potential yields very similar results to the joint approach developed previously, both in terms of the resulting potential parameters, and by Bayes factor evaluation in a phylogenetic context. On the other hand, the leave-one-out approach results in a considerable computational benefit (up to a 1,000 fold decrease in computational time for the optimization procedure).</p> <p>Conclusion</p> <p>Due to its computational speed, the optimization method we propose offers an attractive alternative for the design and empirical evaluation of alternative forms of potentials, using large data sets and high-dimensional parameterizations.</p

    A maximum likelihood framework for protein design

    Get PDF
    BACKGROUND: The aim of protein design is to predict amino-acid sequences compatible with a given target structure. Traditionally envisioned as a purely thermodynamic question, this problem can also be understood in a wider context, where additional constraints are captured by learning the sequence patterns displayed by natural proteins of known conformation. In this latter perspective, however, we still need a theoretical formalization of the question, leading to general and efficient learning methods, and allowing for the selection of fast and accurate objective functions quantifying sequence/structure compatibility. RESULTS: We propose a formulation of the protein design problem in terms of model-based statistical inference. Our framework uses the maximum likelihood principle to optimize the unknown parameters of a statistical potential, which we call an inverse potential to contrast with classical potentials used for structure prediction. We propose an implementation based on Markov chain Monte Carlo, in which the likelihood is maximized by gradient descent and is numerically estimated by thermodynamic integration. The fit of the models is evaluated by cross-validation. We apply this to a simple pairwise contact potential, supplemented with a solvent-accessibility term, and show that the resulting models have a better predictive power than currently available pairwise potentials. Furthermore, the model comparison method presented here allows one to measure the relative contribution of each component of the potential, and to choose the optimal number of accessibility classes, which turns out to be much higher than classically considered. CONCLUSION: Altogether, this reformulation makes it possible to test a wide diversity of models, using different forms of potentials, or accounting for other factors than just the constraint of thermodynamic stability. Ultimately, such model-based statistical analyses may help to understand the forces shaping protein sequences, and driving their evolution

    The Baryon Oscillation Spectroscopic Survey of SDSS-III

    Get PDF
    The Baryon Oscillation Spectroscopic Survey (BOSS) is designed to measure the scale of baryon acoustic oscillations (BAO) in the clustering of matter over a larger volume than the combined efforts of all previous spectroscopic surveys of large scale structure. BOSS uses 1.5 million luminous galaxies as faint as i=19.9 over 10,000 square degrees to measure BAO to redshifts z<0.7. Observations of neutral hydrogen in the Lyman alpha forest in more than 150,000 quasar spectra (g<22) will constrain BAO over the redshift range 2.15<z<3.5. Early results from BOSS include the first detection of the large-scale three-dimensional clustering of the Lyman alpha forest and a strong detection from the Data Release 9 data set of the BAO in the clustering of massive galaxies at an effective redshift z = 0.57. We project that BOSS will yield measurements of the angular diameter distance D_A to an accuracy of 1.0% at redshifts z=0.3 and z=0.57 and measurements of H(z) to 1.8% and 1.7% at the same redshifts. Forecasts for Lyman alpha forest constraints predict a measurement of an overall dilation factor that scales the highly degenerate D_A(z) and H^{-1}(z) parameters to an accuracy of 1.9% at z~2.5 when the survey is complete. Here, we provide an overview of the selection of spectroscopic targets, planning of observations, and analysis of data and data quality of BOSS.Comment: 49 pages, 16 figures, accepted by A

    A C19MC-LIN28A-MYCN Oncogenic Circuit Driven by Hijacked Super-enhancers Is a Distinct Therapeutic Vulnerability in ETMRs: A Lethal Brain Tumor

    Get PDF
    © 2019 Elsevier Inc. Embryonal tumors with multilayered rosettes (ETMRs) are highly lethal infant brain cancers with characteristic amplification of Chr19q13.41 miRNA cluster (C19MC) and enrichment of pluripotency factor LIN28A. Here we investigated C19MC oncogenic mechanisms and discovered a C19MC-LIN28A-MYCN circuit fueled by multiple complex regulatory loops including an MYCN core transcriptional network and super-enhancers resulting from long-range MYCN DNA interactions and C19MC gene fusions. Our data show that this powerful oncogenic circuit, which entraps an early neural lineage network, is potently abrogated by bromodomain inhibitor JQ1, leading to ETMR cell death. Sin-Chan et al. uncover a C19MC-LIN28A-MYCN super-enhancer-dependent oncogenic circuit in embryonal tumors with multilayered rosettes (ETMRs). The circuit entraps an early neural lineage network to sustain embryonic epigenetic programming and is vulnerable to bromodomain inhibition, which promotes ETMR cell death

    Identification of novel risk loci, causal insights, and heritable risk for Parkinson's disease: a meta-analysis of genome-wide association studies

    Get PDF
    Background Genome-wide association studies (GWAS) in Parkinson's disease have increased the scope of biological knowledge about the disease over the past decade. We aimed to use the largest aggregate of GWAS data to identify novel risk loci and gain further insight into the causes of Parkinson's disease. Methods We did a meta-analysis of 17 datasets from Parkinson's disease GWAS available from European ancestry samples to nominate novel loci for disease risk. These datasets incorporated all available data. We then used these data to estimate heritable risk and develop predictive models of this heritability. We also used large gene expression and methylation resources to examine possible functional consequences as well as tissue, cell type, and biological pathway enrichments for the identified risk factors. Additionally, we examined shared genetic risk between Parkinson's disease and other phenotypes of interest via genetic correlations followed by Mendelian randomisation. Findings Between Oct 1, 2017, and Aug 9, 2018, we analysed 7·8 million single nucleotide polymorphisms in 37 688 cases, 18 618 UK Biobank proxy-cases (ie, individuals who do not have Parkinson's disease but have a first degree relative that does), and 1·4 million controls. We identified 90 independent genome-wide significant risk signals across 78 genomic regions, including 38 novel independent risk signals in 37 loci. These 90 variants explained 16–36% of the heritable risk of Parkinson's disease depending on prevalence. Integrating methylation and expression data within a Mendelian randomisation framework identified putatively associated genes at 70 risk signals underlying GWAS loci for follow-up functional studies. Tissue-specific expression enrichment analyses suggested Parkinson's disease loci were heavily brain-enriched, with specific neuronal cell types being implicated from single cell data. We found significant genetic correlations with brain volumes (false discovery rate-adjusted p=0·0035 for intracranial volume, p=0·024 for putamen volume), smoking status (p=0·024), and educational attainment (p=0·038). Mendelian randomisation between cognitive performance and Parkinson's disease risk showed a robust association (p=8·00 × 10−7). Interpretation These data provide the most comprehensive survey of genetic risk within Parkinson's disease to date, to the best of our knowledge, by revealing many additional Parkinson's disease risk loci, providing a biological context for these risk factors, and showing that a considerable genetic component of this disease remains unidentified. These associations derived from European ancestry datasets will need to be followed-up with more diverse data. Funding The National Institute on Aging at the National Institutes of Health (USA), The Michael J Fox Foundation, and The Parkinson's Foundation (see appendix for full list of funding sources)

    The Baryon Oscillation Spectroscopic Survey of SDSS-III

    Get PDF
    The Baryon Oscillation Spectroscopic Survey (BOSS) is designed to measure the scale of baryon acoustic oscillations (BAO) in the clustering of matter over a larger volume than the combined efforts of all previous spectroscopic surveys of large-scale structure. BOSS uses 1.5 million luminous galaxies as faint as i = 19.9 over 10,000 deg(2) to measure BAO to redshifts z < 0.7. Observations of neutral hydrogen in the Ly alpha forest in more than 150,000 quasar spectra (g < 22) will constrain BAO over the redshift range 2.15 < z < 3.5. Early results from BOSS include the first detection of the large-scale three-dimensional clustering of the Ly alpha forest and a strong detection from the Data Release 9 data set of the BAO in the clustering of massive galaxies at an effective redshift z = 0.57. We project that BOSS will yield measurements of the angular diameter distance d(A) to an accuracy of 1.0% at redshifts z = 0.3 and z = 0.57 and measurements of H(z) to 1.8% and 1.7% at the same redshifts. Forecasts for Ly alpha forest constraints predict a measurement of an overall dilation factor that scales the highly degenerate D-A(z) and H-1(z) parameters to an accuracy of 1.9% at z similar to 2.5 when the survey is complete. Here, we provide an overview of the selection of spectroscopic targets, planning of observations, and analysis of data and data quality of BOSS

    Failure of human rhombic lip differentiation underlies medulloblastoma formation

    Get PDF
    Medulloblastoma (MB) comprises a group of heterogeneous paediatric embryonal neoplasms of the hindbrain with strong links to early development of the hindbrain 1–4. Mutations that activate Sonic hedgehog signalling lead to Sonic hedgehog MB in the upper rhombic lip (RL) granule cell lineage 5–8. By contrast, mutations that activate WNT signalling lead to WNT MB in the lower RL 9,10. However, little is known about the more commonly occurring group 4 (G4) MB, which is thought to arise in the unipolar brush cell lineage 3,4. Here we demonstrate that somatic mutations that cause G4 MB converge on the core binding factor alpha (CBFA) complex and mutually exclusive alterations that affect CBFA2T2, CBFA2T3, PRDM6, UTX and OTX2. CBFA2T2 is expressed early in the progenitor cells of the cerebellar RL subventricular zone in Homo sapiens, and G4 MB transcriptionally resembles these progenitors but are stalled in developmental time. Knockdown of OTX2 in model systems relieves this differentiation blockade, which allows MB cells to spontaneously proceed along normal developmental differentiation trajectories. The specific nature of the split human RL, which is destined to generate most of the neurons in the human brain, and its high level of susceptible EOMES +KI67 + unipolar brush cell progenitor cells probably predisposes our species to the development of G4 MB

    Analysis for chromswitch Applications Note

    No full text
    Analysis, code, and data for the applications note chromswitch: A flexible method for detecting chromatin state switches (Selin Jessa and Claudia L. Kleinman
    corecore