188 research outputs found

    Identification de facteurs génétiques impliqués dans les troubles du spectre autistique et de la dyslexie

    Get PDF
    Les troubles du spectre autistique (TSA) touchent approximativement 1% de la population générale. Ces troubles se caractérisent par un déficit de la communication sociale, ainsi que des comportements stéréotypés et des intérêts restreints. Plusieurs gènes impliqués dans le déterminisme des TSA ont été identifiés, comme par exemple les gènes NLGN3-4X, NRXN1-3 et SHANK1-3. Au cours des années précédentes, les TSA ont été considérés comme un ensemble complexe de troubles monogéniques. Cependant, les études récentes du génome complet suggèrent la présence de gènes modificateurs ( multiple hits model ). La dyslexie est caractérisée par un trouble dans l apprentissage de la lecture et de l écriture qui touche 5- 15% de la population générale. Les facteurs génétiques impliqués restent pour l instant inconnus car seuls des gènes ou loci candidats ont été identifiés. Mon projet de thèse avait pour objectif de poursuivre l identification des facteurs génétiques impliqués dans les TSA et de découvrir un premier facteur génétique pour la dyslexie. Pour cela, deux types de populations ont été étudiés : d une part des patients atteints de TSA (N>600) provenant de France, de Suède et des Iles Faroe, d autre part des patients atteints de dyslexie (N>200) provenant de France, en particulier une famille de 11 personnes atteintes sur 3 générations. J ai utilisé à la fois la technologie des puces à ADN Illumina (600 K et 5M) et le séquençage complet du génome humain pour effectuer des analyses de liaison et d association. Pour les TSA, grâce aux analyses de CNVs, j ai pu identifier des gènes candidats pour l autisme et confirmer l association de plusieurs gènes synaptiques avec l autisme. En particulier, l étude d une population de 30 patients des îles Faroe a pu confirmer l implication des gènes NLGN1 et NRXN1 dans l autisme et identifier un nouveau gène candidat IQSEC3. En parallèle, j ai exploréPRRT2 localisé en 16p11.2. PRRT2 code pour un membre du complexe SNARE synaptique qui permet la libération des vésicules synaptiques. Je n ai pas pu mettre en évidence d association avec les TSA, mais j ai montré que ce gène important pour certaines maladies neurologiques était sous pression de sélection différente selon les populations. Pour la dyslexie, j ai effectué une analyse de liaison (méthode des lod-scores) pour une grande famille de 11 individus atteints sur trois générations. Cette étude a permis d identifier CNTNAP2 comme un gène de vulnérabilité à la dyslexie. Cette découverte est importante car ce même gène est aussi associé aux TSA. Par contre, aucune des 20 variations rares découvertes par le séquençage complet du génome n est localisée dans les parties codantes du gène. Plusieurs variations localisées dans des régions régulatrices sont candidates. En conclusion, les résultats de ma thèse ont permis d identifier des gènes candidats pour les TSA, de confirmer le rôle des gènes synaptiques dans ce trouble, de montrer pour la première fois grâce à une analyse de liaison le rôle de CNTNAP2 dans la dyslexie.Autism spectrum disorders (ASD) affect 1% of the general population. These disorders are characterized by deficits in social communication as well as stereotyped behaviors and restricted interests. Several genes involved in the determination of ASD have been identified, such as NLGN3-4, NRXN1-3 and SHANK1-3. In the previous years, ASD have been considered as a complex set of monogenic disorders. Recent studies on the complete genome nevertheless suggest the presence of modifier genes ("multiple hits model"). Dyslexia is characterized by difficulties in learning to read and write. It affects 5-15 % of the general population. Genetic factors involved remain unknown. Only candidate genes or loci have been identified. My thesis had two main objectives: pursuing the identification of genetic factors involved in ASD, and discovering a first genetic factor for dyslexia. I therefore studied two types of populations: on the one hand a group of patients with ASD (N > 600) from France, Sweden and the Faroe Islands, and on the other hand another group of patients with dyslexia (N > 200) from France, and more specifically a family of 11 people followed over 3 generations. I used both Illumina microarrays technology (600K and 5M) and the complete human genome sequencing to conduct linkage and association analyses. Regarding ASD, CNVs (copy number variants) analyses allowed me to confirm the association of several synaptic genes with autism and to identify new candidate genes. In particular, the study of a population of 30 patients from the Faroe Islands confirmed the involvement of NLGN1 and NRXN1 genes in autism and identified a new candidate gene, IQSEC3. At the same time, I explored PRRT2 located in 16p11.2. PRRT2 encodes a member of the synaptic SNARE complex that allows the release of synaptic vesicles. I have not been able to demonstrate any association with ASD, but I showed that this gene, which is important for some neurological diseases, was under different selection pressures according to the population considered. Regarding dyslexia, I realized a linkage analysis (lod-score method) for a large family of 11 individuals, with three generations affected. This study identified the CNTNAP2 gene as a vulnerability factor for dyslexia. This finding is important because this gene is also associated with ASD. Nevertheless, none of the 20 rare variations discovered by whole genome sequencing is localized in the coding parts of the gene. Only several variations localized in regulatory regions are robust candidates. To conclude, my findings enabled the identification of new candidate genes for ASD, the confirmation of the role of synaptic genes in this disorder, and the highlight for the first time of the role of CNTNAP2 in dyslexia through linkage analysis.PARIS5-Bibliotheque electronique (751069902) / SudocSudocFranceF

    Geodesic Sinkhorn: optimal transport for high-dimensional datasets

    Full text link
    Understanding the dynamics and reactions of cells from population snapshots is a major challenge in single-cell transcriptomics. Here, we present Geodesic Sinkhorn, a method for interpolating populations along a data manifold that leverages existing kernels developed for single-cell dimensionality reduction and visualization methods. Our Geodesic Sinkhorn method uses a heat-geodesic ground distance that, as compared to Euclidean ground distances, is more accurate for interpolating single-cell dynamics on a wide variety of datasets and significantly speeds up the computation for sparse kernels. We first apply Geodesic Sinkhorn to 10 single-cell transcriptomics time series interpolation datasets as a drop-in replacement for existing interpolation methods where it outperforms on all datasets, showing its effectiveness in modeling cell dynamics. Second, we show how to efficiently approximate the operator with polynomial kernels allowing us to improve scaling to large datasets. Finally, we define the conditional Wasserstein-average treatment effect and show how it can elucidate the treatment effect on single-cell populations on a drug screen.Comment: 15 pages, 5 tables, 5 figures, submitted to RECOMB 202

    A Heat Diffusion Perspective on Geodesic Preserving Dimensionality Reduction

    Full text link
    Diffusion-based manifold learning methods have proven useful in representation learning and dimensionality reduction of modern high dimensional, high throughput, noisy datasets. Such datasets are especially present in fields like biology and physics. While it is thought that these methods preserve underlying manifold structure of data by learning a proxy for geodesic distances, no specific theoretical links have been established. Here, we establish such a link via results in Riemannian geometry explicitly connecting heat diffusion to manifold distances. In this process, we also formulate a more general heat kernel based manifold embedding method that we call heat geodesic embeddings. This novel perspective makes clearer the choices available in manifold learning and denoising. Results show that our method outperforms existing state of the art in preserving ground truth manifold distances, and preserving cluster structure in toy datasets. We also showcase our method on single cell RNA-sequencing datasets with both continuum and cluster structure, where our method enables interpolation of withheld timepoints of data. Finally, we show that parameters of our more general method can be configured to give results similar to PHATE (a state-of-the-art diffusion based manifold learning method) as well as SNE (an attraction/repulsion neighborhood based method that forms the basis of t-SNE).Comment: 31 pages, 13 figures, 10 table

    Manifold Interpolating Optimal-Transport Flows for Trajectory Inference

    Full text link
    We present a method called Manifold Interpolating Optimal-Transport Flow (MIOFlow) that learns stochastic, continuous population dynamics from static snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models, manifold learning, and optimal transport by training neural ordinary differential equations (Neural ODE) to interpolate between static population snapshots as penalized by optimal transport with manifold ground distance. Further, we ensure that the flow follows the geometry by operating in the latent space of an autoencoder that we call a geodesic autoencoder (GAE). In GAE the latent space distance between points is regularized to match a novel multiscale geodesic distance on the data manifold that we define. We show that this method is superior to normalizing flows, Schr\"odinger bridges and other generative models that are designed to flow from noise to data in terms of interpolating between populations. Theoretically, we link these trajectories with dynamic optimal transport. We evaluate our method on simulated data with bifurcations and merges, as well as scRNA-seq data from embryoid body differentiation, and acute myeloid leukemia treatment.Comment: Presented at NeurIPS 2022, 24 pages, 7 tables, 14 figure

    Simulation-free Schr\"odinger bridges via score and flow matching

    Full text link
    We present simulation-free score and flow matching ([SF]2^2M), a simulation-free objective for inferring stochastic dynamics given unpaired source and target samples drawn from arbitrary distributions. Our method generalizes both the score-matching loss used in the training of diffusion models and the recently proposed flow matching loss used in the training of continuous normalizing flows. [SF]2^2M interprets continuous-time stochastic generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static entropy-regularized optimal transport, or a minibatch approximation, to efficiently learn the SB without simulating the learned stochastic process. We find that [SF]2^2M is more efficient and gives more accurate solutions to the SB problem than simulation-based methods from prior work. Finally, we apply [SF]2^2M to the problem of learning cell dynamics from snapshot data. Notably, [SF]2^2M is the first method to accurately model cell dynamics in high dimensions and can recover known gene regulatory networks from simulated data.Comment: A version of this paper appeared in the New Frontiers in Learning, Control, and Dynamical Systems workshop at ICML 2023. Code: https://github.com/atong01/conditional-flow-matchin

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Full text link
    Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, OT-CFM is the first method to compute dynamic OT in a simulation-free way. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference.Comment: A version of this paper appeared in the New Frontiers in Learning, Control, and Dynamical Systems workshop at ICML 2023. Title change from v1. Code: https://github.com/atong01/conditional-flow-matchin

    Improving and generalizing flow-based generative models with minibatch optimal transport

    Get PDF
    Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, we show that when the true OT plan is available, our OT-CFM method approximates dynamic OT. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference

    Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy

    Full text link
    Entropy and mutual information in neural networks provide rich information on the learning process, but they have proven difficult to compute reliably in high dimensions. Indeed, in noisy and high-dimensional data, traditional estimates in ambient dimensions approach a fixed entropy and are prohibitively hard to compute. To address these issues, we leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. Specifically, we define diffusion spectral entropy (DSE) in neural representations of a dataset as well as diffusion spectral mutual information (DSMI) between different variables representing data. First, we show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data that outperform classic Shannon entropy, nonparametric estimation, and mutual information neural estimation (MINE). We then study the evolution of representations in classification networks with supervised learning, self-supervision, or overfitting. We observe that (1) DSE of neural representations increases during training; (2) DSMI with the class label increases during generalizable learning but stays stagnant during overfitting; (3) DSMI with the input signal shows differing trends: on MNIST it increases, while on CIFAR-10 and STL-10 it decreases. Finally, we show that DSE can be used to guide better network initialization and that DSMI can be used to predict downstream classification accuracy across 962 models on ImageNet. The official implementation is available at https://github.com/ChenLiu-1996/DiffusionSpectralEntropy

    Investigating the contributions of circadian pathway and insomnia risk genes to autism and sleep disturbances

    Get PDF
    Sleep disturbance is prevalent in youth with Autism Spectrum Disorder (ASD). Researchers have posited that circadian dysfunction may contribute to sleep problems or exacerbate ASD symptomatology. However, there is limited genetic evidence of this. It is also unclear how insomnia risk genes identified through GWAS in general populations are related to ASD and common sleep problems like insomnia traits in ASD. We investigated the contribution of copy number variants (CNVs) encompassing circadian pathway genes and insomnia risk genes to ASD risk as well as sleep disturbances in children with ASD. We studied 5860 ASD probands and 2092 unaffected siblings from the Simons Simplex Collection (SSC) and MSSNG database, as well as 7509 individuals from two unselected populations (IMAGEN and Generation Scotland). Sleep duration and insomnia symptoms were parent reported for SSC probands. We identified 335 and 616 rare CNVs encompassing circadian and insomnia risk genes respectively. Deletions and duplications with circadian genes were overrepresented in ASD probands compared to siblings and unselected controls. For insomnia-risk genes, deletions (not duplications) were associated with ASD in both cohorts. Results remained significant after adjusting for cognitive ability. CNVs containing circadian pathway and insomnia risk genes showed a stronger association with ASD, compared to CNVs containing other genes. Circadian genes did not influence sleep duration or insomnia traits in ASD. Insomnia risk genes intolerant to haploinsufficiency increased risk for insomnia when duplicated. CNVs encompassing circadian and insomnia risk genes increase ASD liability with little to no observable impacts on sleep disturbances
    corecore