188 research outputs found
Identification de facteurs génétiques impliqués dans les troubles du spectre autistique et de la dyslexie
Les troubles du spectre autistique (TSA) touchent approximativement 1% de la population générale. Ces troubles se caractérisent par un déficit de la communication sociale, ainsi que des comportements stéréotypés et des intérêts restreints. Plusieurs gènes impliqués dans le déterminisme des TSA ont été identifiés, comme par exemple les gènes NLGN3-4X, NRXN1-3 et SHANK1-3. Au cours des années précédentes, les TSA ont été considérés comme un ensemble complexe de troubles monogéniques. Cependant, les études récentes du génome complet suggèrent la présence de gènes modificateurs ( multiple hits model ). La dyslexie est caractérisée par un trouble dans l apprentissage de la lecture et de l écriture qui touche 5- 15% de la population générale. Les facteurs génétiques impliqués restent pour l instant inconnus car seuls des gènes ou loci candidats ont été identifiés. Mon projet de thèse avait pour objectif de poursuivre l identification des facteurs génétiques impliqués dans les TSA et de découvrir un premier facteur génétique pour la dyslexie. Pour cela, deux types de populations ont été étudiés : d une part des patients atteints de TSA (N>600) provenant de France, de Suède et des Iles Faroe, d autre part des patients atteints de dyslexie (N>200) provenant de France, en particulier une famille de 11 personnes atteintes sur 3 générations. J ai utilisé à la fois la technologie des puces à ADN Illumina (600 K et 5M) et le séquençage complet du génome humain pour effectuer des analyses de liaison et d association. Pour les TSA, grâce aux analyses de CNVs, j ai pu identifier des gènes candidats pour l autisme et confirmer l association de plusieurs gènes synaptiques avec l autisme. En particulier, l étude d une population de 30 patients des îles Faroe a pu confirmer l implication des gènes NLGN1 et NRXN1 dans l autisme et identifier un nouveau gène candidat IQSEC3. En parallèle, j ai exploréPRRT2 localisé en 16p11.2. PRRT2 code pour un membre du complexe SNARE synaptique qui permet la libération des vésicules synaptiques. Je n ai pas pu mettre en évidence d association avec les TSA, mais j ai montré que ce gène important pour certaines maladies neurologiques était sous pression de sélection différente selon les populations. Pour la dyslexie, j ai effectué une analyse de liaison (méthode des lod-scores) pour une grande famille de 11 individus atteints sur trois générations. Cette étude a permis d identifier CNTNAP2 comme un gène de vulnérabilité à la dyslexie. Cette découverte est importante car ce même gène est aussi associé aux TSA. Par contre, aucune des 20 variations rares découvertes par le séquençage complet du génome n est localisée dans les parties codantes du gène. Plusieurs variations localisées dans des régions régulatrices sont candidates. En conclusion, les résultats de ma thèse ont permis d identifier des gènes candidats pour les TSA, de confirmer le rôle des gènes synaptiques dans ce trouble, de montrer pour la première fois grâce à une analyse de liaison le rôle de CNTNAP2 dans la dyslexie.Autism spectrum disorders (ASD) affect 1% of the general population. These disorders are characterized by deficits in social communication as well as stereotyped behaviors and restricted interests. Several genes involved in the determination of ASD have been identified, such as NLGN3-4, NRXN1-3 and SHANK1-3. In the previous years, ASD have been considered as a complex set of monogenic disorders. Recent studies on the complete genome nevertheless suggest the presence of modifier genes ("multiple hits model"). Dyslexia is characterized by difficulties in learning to read and write. It affects 5-15 % of the general population. Genetic factors involved remain unknown. Only candidate genes or loci have been identified. My thesis had two main objectives: pursuing the identification of genetic factors involved in ASD, and discovering a first genetic factor for dyslexia. I therefore studied two types of populations: on the one hand a group of patients with ASD (N > 600) from France, Sweden and the Faroe Islands, and on the other hand another group of patients with dyslexia (N > 200) from France, and more specifically a family of 11 people followed over 3 generations. I used both Illumina microarrays technology (600K and 5M) and the complete human genome sequencing to conduct linkage and association analyses. Regarding ASD, CNVs (copy number variants) analyses allowed me to confirm the association of several synaptic genes with autism and to identify new candidate genes. In particular, the study of a population of 30 patients from the Faroe Islands confirmed the involvement of NLGN1 and NRXN1 genes in autism and identified a new candidate gene, IQSEC3. At the same time, I explored PRRT2 located in 16p11.2. PRRT2 encodes a member of the synaptic SNARE complex that allows the release of synaptic vesicles. I have not been able to demonstrate any association with ASD, but I showed that this gene, which is important for some neurological diseases, was under different selection pressures according to the population considered. Regarding dyslexia, I realized a linkage analysis (lod-score method) for a large family of 11 individuals, with three generations affected. This study identified the CNTNAP2 gene as a vulnerability factor for dyslexia. This finding is important because this gene is also associated with ASD. Nevertheless, none of the 20 rare variations discovered by whole genome sequencing is localized in the coding parts of the gene. Only several variations localized in regulatory regions are robust candidates. To conclude, my findings enabled the identification of new candidate genes for ASD, the confirmation of the role of synaptic genes in this disorder, and the highlight for the first time of the role of CNTNAP2 in dyslexia through linkage analysis.PARIS5-Bibliotheque electronique (751069902) / SudocSudocFranceF
Geodesic Sinkhorn: optimal transport for high-dimensional datasets
Understanding the dynamics and reactions of cells from population snapshots
is a major challenge in single-cell transcriptomics. Here, we present Geodesic
Sinkhorn, a method for interpolating populations along a data manifold that
leverages existing kernels developed for single-cell dimensionality reduction
and visualization methods. Our Geodesic Sinkhorn method uses a heat-geodesic
ground distance that, as compared to Euclidean ground distances, is more
accurate for interpolating single-cell dynamics on a wide variety of datasets
and significantly speeds up the computation for sparse kernels. We first apply
Geodesic Sinkhorn to 10 single-cell transcriptomics time series interpolation
datasets as a drop-in replacement for existing interpolation methods where it
outperforms on all datasets, showing its effectiveness in modeling cell
dynamics. Second, we show how to efficiently approximate the operator with
polynomial kernels allowing us to improve scaling to large datasets. Finally,
we define the conditional Wasserstein-average treatment effect and show how it
can elucidate the treatment effect on single-cell populations on a drug screen.Comment: 15 pages, 5 tables, 5 figures, submitted to RECOMB 202
A Heat Diffusion Perspective on Geodesic Preserving Dimensionality Reduction
Diffusion-based manifold learning methods have proven useful in
representation learning and dimensionality reduction of modern high
dimensional, high throughput, noisy datasets. Such datasets are especially
present in fields like biology and physics. While it is thought that these
methods preserve underlying manifold structure of data by learning a proxy for
geodesic distances, no specific theoretical links have been established. Here,
we establish such a link via results in Riemannian geometry explicitly
connecting heat diffusion to manifold distances. In this process, we also
formulate a more general heat kernel based manifold embedding method that we
call heat geodesic embeddings. This novel perspective makes clearer the choices
available in manifold learning and denoising. Results show that our method
outperforms existing state of the art in preserving ground truth manifold
distances, and preserving cluster structure in toy datasets. We also showcase
our method on single cell RNA-sequencing datasets with both continuum and
cluster structure, where our method enables interpolation of withheld
timepoints of data. Finally, we show that parameters of our more general method
can be configured to give results similar to PHATE (a state-of-the-art
diffusion based manifold learning method) as well as SNE (an
attraction/repulsion neighborhood based method that forms the basis of t-SNE).Comment: 31 pages, 13 figures, 10 table
Manifold Interpolating Optimal-Transport Flows for Trajectory Inference
We present a method called Manifold Interpolating Optimal-Transport Flow
(MIOFlow) that learns stochastic, continuous population dynamics from static
snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models,
manifold learning, and optimal transport by training neural ordinary
differential equations (Neural ODE) to interpolate between static population
snapshots as penalized by optimal transport with manifold ground distance.
Further, we ensure that the flow follows the geometry by operating in the
latent space of an autoencoder that we call a geodesic autoencoder (GAE). In
GAE the latent space distance between points is regularized to match a novel
multiscale geodesic distance on the data manifold that we define. We show that
this method is superior to normalizing flows, Schr\"odinger bridges and other
generative models that are designed to flow from noise to data in terms of
interpolating between populations. Theoretically, we link these trajectories
with dynamic optimal transport. We evaluate our method on simulated data with
bifurcations and merges, as well as scRNA-seq data from embryoid body
differentiation, and acute myeloid leukemia treatment.Comment: Presented at NeurIPS 2022, 24 pages, 7 tables, 14 figure
Simulation-free Schr\"odinger bridges via score and flow matching
We present simulation-free score and flow matching ([SF]M), a
simulation-free objective for inferring stochastic dynamics given unpaired
source and target samples drawn from arbitrary distributions. Our method
generalizes both the score-matching loss used in the training of diffusion
models and the recently proposed flow matching loss used in the training of
continuous normalizing flows. [SF]M interprets continuous-time stochastic
generative modeling as a Schr\"odinger bridge (SB) problem. It relies on static
entropy-regularized optimal transport, or a minibatch approximation, to
efficiently learn the SB without simulating the learned stochastic process. We
find that [SF]M is more efficient and gives more accurate solutions to the
SB problem than simulation-based methods from prior work. Finally, we apply
[SF]M to the problem of learning cell dynamics from snapshot data. Notably,
[SF]M is the first method to accurately model cell dynamics in high
dimensions and can recover known gene regulatory networks from simulated data.Comment: A version of this paper appeared in the New Frontiers in Learning,
Control, and Dynamical Systems workshop at ICML 2023. Code:
https://github.com/atong01/conditional-flow-matchin
Improving and generalizing flow-based generative models with minibatch optimal transport
Continuous normalizing flows (CNFs) are an attractive generative modeling
technique, but they have been held back by limitations in their
simulation-based maximum likelihood training. We introduce the generalized
conditional flow matching (CFM) technique, a family of simulation-free training
objectives for CNFs. CFM features a stable regression objective like that used
to train the stochastic flow in diffusion models but enjoys the efficient
inference of deterministic flow models. In contrast to both diffusion models
and prior CNF training algorithms, CFM does not require the source distribution
to be Gaussian or require evaluation of its density. A variant of our objective
is optimal transport CFM (OT-CFM), which creates simpler flows that are more
stable to train and lead to faster inference, as evaluated in our experiments.
Furthermore, OT-CFM is the first method to compute dynamic OT in a
simulation-free way. Training CNFs with CFM improves results on a variety of
conditional and unconditional generation tasks, such as inferring single cell
dynamics, unsupervised image translation, and Schr\"odinger bridge inference.Comment: A version of this paper appeared in the New Frontiers in Learning,
Control, and Dynamical Systems workshop at ICML 2023. Title change from v1.
Code: https://github.com/atong01/conditional-flow-matchin
Improving and generalizing flow-based generative models with minibatch optimal transport
Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have been held back by limitations in their simulation-based maximum likelihood training. We introduce the generalized conditional flow matching (CFM) technique, a family of simulation-free training objectives for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, CFM does not require the source distribution to be Gaussian or require evaluation of its density. A variant of our objective is optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Furthermore, we show that when the true OT plan is available, our OT-CFM method approximates dynamic OT. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks, such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy
Entropy and mutual information in neural networks provide rich information on
the learning process, but they have proven difficult to compute reliably in
high dimensions. Indeed, in noisy and high-dimensional data, traditional
estimates in ambient dimensions approach a fixed entropy and are prohibitively
hard to compute. To address these issues, we leverage data geometry to access
the underlying manifold and reliably compute these information-theoretic
measures. Specifically, we define diffusion spectral entropy (DSE) in neural
representations of a dataset as well as diffusion spectral mutual information
(DSMI) between different variables representing data. First, we show that they
form noise-resistant measures of intrinsic dimensionality and relationship
strength in high-dimensional simulated data that outperform classic Shannon
entropy, nonparametric estimation, and mutual information neural estimation
(MINE). We then study the evolution of representations in classification
networks with supervised learning, self-supervision, or overfitting. We observe
that (1) DSE of neural representations increases during training; (2) DSMI with
the class label increases during generalizable learning but stays stagnant
during overfitting; (3) DSMI with the input signal shows differing trends: on
MNIST it increases, while on CIFAR-10 and STL-10 it decreases. Finally, we show
that DSE can be used to guide better network initialization and that DSMI can
be used to predict downstream classification accuracy across 962 models on
ImageNet. The official implementation is available at
https://github.com/ChenLiu-1996/DiffusionSpectralEntropy
Investigating the contributions of circadian pathway and insomnia risk genes to autism and sleep disturbances
Sleep disturbance is prevalent in youth with Autism Spectrum Disorder (ASD). Researchers have posited that circadian dysfunction may contribute to sleep problems or exacerbate ASD symptomatology. However, there is limited genetic evidence of this. It is also unclear how insomnia risk genes identified through GWAS in general populations are related to ASD and common sleep problems like insomnia traits in ASD. We investigated the contribution of copy number variants (CNVs) encompassing circadian pathway genes and insomnia risk genes to ASD risk as well as sleep disturbances in children with ASD. We studied 5860 ASD probands and 2092 unaffected siblings from the Simons Simplex Collection (SSC) and MSSNG database, as well as 7509 individuals from two unselected populations (IMAGEN and Generation Scotland). Sleep duration and insomnia symptoms were parent reported for SSC probands. We identified 335 and 616 rare CNVs encompassing circadian and insomnia risk genes respectively. Deletions and duplications with circadian genes were overrepresented in ASD probands compared to siblings and unselected controls. For insomnia-risk genes, deletions (not duplications) were associated with ASD in both cohorts. Results remained significant after adjusting for cognitive ability. CNVs containing circadian pathway and insomnia risk genes showed a stronger association with ASD, compared to CNVs containing other genes. Circadian genes did not influence sleep duration or insomnia traits in ASD. Insomnia risk genes intolerant to haploinsufficiency increased risk for insomnia when duplicated. CNVs encompassing circadian and insomnia risk genes increase ASD liability with little to no observable impacts on sleep disturbances
- …