3 research outputs found
Integration of “omics” Data and Phenotypic Data Within a Unified Extensible Multimodal Framework
Analysis of “omics” data is often a long and segmented process, encompassing multiple stages from initial data collection to processing, quality control and visualization. The cross-modal nature of recent genomic analyses renders this process challenging to both automate and standardize; consequently, users often resort to manual interventions that compromise data reliability and reproducibility. This in turn can produce multiple versions of datasets across storage systems. As a result, scientists can lose significant time and resources trying to execute and monitor their analytical workflows and encounter difficulties sharing versioned data. In 2015, the Ludmer Centre for Neuroinformatics and Mental Health at McGill University brought together expertise from the Douglas Mental Health University Institute, the Lady Davis Institute and the Montreal Neurological Institute (MNI) to form a genetics/epigenetics working group. The objectives of this working group are to: (i) design an automated and seamless process for (epi)genetic data that consolidates heterogeneous datasets into the LORIS open-source data platform; (ii) streamline data analysis; (iii) integrate results with provenance information; and (iv) facilitate structured and versioned sharing of pipelines for optimized reproducibility using high-performance computing (HPC) environments via the CBRAIN processing portal. This article outlines the resulting generalizable “omics” framework and its benefits, specifically, the ability to: (i) integrate multiple types of biological and multi-modal datasets (imaging, clinical, demographics and behavioral); (ii) automate the process of launching analysis pipelines on HPC platforms; (iii) remove the bioinformatic barriers that are inherent to this process; (iv) ensure standardization and transparent sharing of processing pipelines to improve computational consistency; (v) store results in a queryable web interface; (vi) offer visualization tools to better view the data; and (vii) provide the mechanisms to ensure usability and reproducibility. This framework for workflows facilitates brain research discovery by reducing human error through automation of analysis pipelines and seamless linking of multimodal data, allowing investigators to focus on research instead of data handling
Dealing with heterogeneity in the prediction of clinical diagnosis
Le diagnostic assisté par ordinateur est un domaine de recherche en émergence et se situe
à l’intersection de l’imagerie médicale et de l’apprentissage machine. Les données médi-
cales sont de nature très hétérogène et nécessitent une attention particulière lorsque l’on
veut entraîner des modèles de prédiction. Dans cette thèse, j’ai exploré deux sources
d’hétérogénéité, soit l’agrégation multisites et l’hétérogénéité des étiquettes cliniques
dans le contexte de l’imagerie par résonance magnétique (IRM) pour le diagnostic de la
maladie d’Alzheimer (MA). La première partie de ce travail consiste en une introduction
générale sur la MA, l’IRM et les défis de l’apprentissage machine en imagerie médicale.
Dans la deuxième partie de ce travail, je présente les trois articles composant la thèse.
Enfin, la troisième partie porte sur une discussion des contributions et perspectives fu-
tures de ce travail de recherche. Le premier article de cette thèse montre que l’agrégation
des données sur plusieurs sites d’acquisition entraîne une certaine perte, comparative-
ment à l’analyse sur un seul site, qui tend à diminuer plus la taille de l’échantillon aug-
mente. Le deuxième article de cette thèse examine la généralisabilité des modèles de
prédiction à l’aide de divers schémas de validation croisée. Les résultats montrent que
la formation et les essais sur le même ensemble de sites surestiment la précision du
modèle, comparativement aux essais sur des nouveaux sites. J’ai également montré que
l’entraînement sur un grand nombre de sites améliore la précision sur des nouveaux sites.
Le troisième et dernier article porte sur l’hétérogénéité des étiquettes cliniques et pro-
pose un nouveau cadre dans lequel il est possible d’identifier un sous-groupe d’individus
qui partagent une signature homogène hautement prédictive de la démence liée à la MA.
Cette signature se retrouve également chez les patients présentant des symptômes mod-
érés. Les résultats montrent que 90% des sujets portant la signature ont progressé vers
la démence en trois ans. Les travaux de cette thèse apportent ainsi de nouvelles con-
tributions à la manière dont nous approchons l’hétérogénéité en diagnostic médical et
proposent des pistes de solution pour tirer profit de cette hétérogénéité.Computer assisted diagnosis has emerged as a popular area of research at the intersection
of medical imaging and machine learning. Medical data are very heterogeneous in nature
and therefore require careful attention when one wants to train prediction models. In
this thesis, I explored two sources of heterogeneity, multisite aggregation and clinical
label heterogeneity, in an application of magnetic resonance imaging to the diagnosis
of Alzheimer’s disease. In the process, I learned about the feasibility of multisite data
aggregation and how to leverage that heterogeneity in order to improve generalizability
of prediction models. Part one of the document is a general context introduction to
Alzheimer’s disease, magnetic resonance imaging, and machine learning challenges in
medical imaging. In part two, I present my research through three articles (two published
and one in preparation). Finally, part three provides a discussion of my contributions
and hints to possible future developments. The first article shows that data aggregation
across multiple acquisition sites incurs some loss, compared to single site analysis, that
tends to diminish as the sample size increase. These results were obtained through semisynthetic
Monte-Carlo simulations based on real data. The second article investigates the
generalizability of prediction models with various cross-validation schemes. I showed
that training and testing on the same batch of sites over-estimates the accuracy of the
model, compared to testing on unseen sites. However, I also showed that training on a
large number of sites improves the accuracy on unseen sites. The third article, on clinical
label heterogeneity, proposes a new framework where we can identify a subgroup of
individuals that share a homogeneous signature highly predictive of AD dementia. That
signature could also be found in patients with mild symptoms, 90% of whom progressed
to dementia within three years. The thesis thus makes new contributions to dealing
with heterogeneity in medical diagnostic applications and proposes ways to leverage
that heterogeneity to our benefit