285 research outputs found

    Novel clustering methods for complex cluster structures in behavioral sciences

    Get PDF
    Large-scale data sets with a large number of variables become increasingly available in behavioral research. Encompassing a wide range of measurements and indicators, they provide behavioral scientists with unprecedented opportunities to synthesize different pieces of information so that novel - and sometimes subtle – subgroups (also called clusters) of populations can be identified. The successful detection of clusters is of great practical significance for a wide range of social and behavioral research topics. For example, in treating depressed patients, the first step in generating personalized recommendations is to accurately link the patients to the many subtypes of depression. In the organization context, it is highly problematic to assume that all leaders should follow the same developmental paths; in fact, tailoring training programs to the unique strengths of different leadership subgroups (e.g., the down-to-earth leaders and the excessively charismatic leaders) is always more effective than general developmental programs. When trying to understand the cognitive process underlying one’s voting behavior, once again, a one-size-fits-all approach likely produces erroneous descriptions. The broad social context as well as the surrounding environment in which a person grows up likely yields clusters of voters; only those belonging to the same cluster share a similar decision-making process for voting. To provide behavioral researchers with the best tool for accurately recovering the clusters hidden in large, complex data sets, this dissertation developed new statistical models and computational tools and implemented these novel approaches in publicly accessible software. Generally speaking, the novel methods developed here advance previous approaches by addressing the following three major challenges. First, as noise is ubiquitous in psychological measures, a considerable number of variables collected may be completely irrelevant to the hidden clusters. These irrelevant variables have to be completely and automatically filtered out during data analysis. Second, when integrating variables from diverse data sources (for example questionnaires and genetic information, GPS coordinates, social media footprints, etc.), it is desirable to capture both the unique characteristics pertaining to each data source and the shared or connected characteristics across the many data sources. Third, when translating data analytics results into substantive conclusions so as to inform critical decisions (e.g., medical decisions, personnel selection, etc.), effective and accurate communication is vital yet not necessarily easy to achieve. The two most prominent difficulties are communicating the confidence and (un)certainty in the clusters recovered and visualizing the results through very accessible graphs. With a variety of computer-simulated data and empirical behavioral data covering topics in clinical, social, personality, and organizational psychology, we were able to conclude that the various methods developed in the dissertation are more versatile, effective, and accurate in identifying subtle clusters in complex data sets, provide rich and unique insights in interpreting these clusters, and, thanks to the development of many software, can be readily accessed without many technical barriers. These methods are therefore useful for behavioral researchers to navigate in an increasingly digitized world and to recognize structures from massive information

    Computational approaches for single-cell omics and multi-omics data

    Get PDF
    Single-cell omics and multi-omics technologies have enabled the study of cellular heterogeneity with unprecedented resolution and the discovery of new cell types. The core of identifying heterogeneous cell types, both existing and novel ones, relies on efficient computational approaches, including especially cluster analysis. Additionally, gene regulatory network analysis and various integrative approaches are needed to combine data across studies and different multi-omics layers. This thesis comprehensively compared Bayesian clustering models for single-cell RNAsequencing (scRNA-seq) data and selected integrative approaches were used to study the cell-type specific gene regulation of uterus. Additionally, single-cell multi-omics data integration approaches for cell heterogeneity analysis were investigated. Article I investigated analytical approaches for cluster analysis in scRNA-seq data, particularly, latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) models. The comparison of LDA and HDP together with the existing state-of-art methods revealed that topic modeling-based models can be useful in scRNA-seq cluster analysis. Evaluation of the cluster qualities for LDA and HDP with intrinsic and extrinsic cluster quality metrics indicated that the clustering performance of these methods is dataset dependent. Article II and Article III focused on cell-type specific integrative analysis of uterine or decidual stromal (dS) and natural killer (dNK) cells that are important for successful pregnancy. Article II integrated the existing preeclampsia RNA-seq studies of the decidua together with recent scRNA-seq datasets in order to investigate cell-type-specific contributions of early onset preeclampsia (EOP) and late onset preeclampsia (LOP). It was discovered that the dS marker genes were enriched for LOP downregulated genes and the dNK marker genes were enriched for upregulated EOP genes. Article III presented a gene regulatory network analysis for the subpopulations of dS and dNK cells. This study identified novel subpopulation specific transcription factors that promote decidualization of stromal cells and dNK mediated maternal immunotolerance. In Article IV, different strategies and methodological frameworks for data integration in single-cell multi-omics data analysis were reviewed in detail. Data integration methods were grouped into early, late and intermediate data integration strategies. The specific stage and order of data integration can have substantial effect on the results of the integrative analysis. The central details of the approaches were presented, and potential future directions were discussed.  Laskennallisia menetelmiä yksisolusekvensointi- ja multiomiikkatulosten analyyseihin Yksisolusekvensointitekniikat mahdollistavat solujen heterogeenisyyden tutkimuksen ennennäkemättömällä resoluutiolla ja uusien solutyyppien löytämisen. Solutyyppien tunnistamisessa keskeisessä roolissa on ryhmittely eli klusterointianalyysi. Myös geenien säätelyverkostojen sekä eri molekyylidatatasojen yhdistäminen on keskeistä analyysissä. Väitöskirjassa verrataan bayesilaisia klusterointimenetelmiä ja yhdistetään eri menetelmillä kerättyjä tietoja kohdun solutyyppispesifisessä geeninsäätelyanalyysissä. Lisäksi yksisolutiedon integraatiomenetelmiä selvitetään kattavasti. Julkaisu I keskittyy analyyttisten menetelmien, erityisesti latenttiin Dirichletallokaatioon (LDA) ja hierarkkiseen Dirichlet-prosessiin (HDP) perustuvien mallien tutkimiseen yksisoludatan klusterianalyysissä. Kattava vertailu näiden kahden mallin sekä olemassa olevien menetelmien kanssa paljasti, että aihemallinnuspohjaiset menetelmät voivat olla hyödyllisiä yksisoludatan klusterianalyysissä. Menetelmien suorituskyky riippui myös kunkin analysoitavan datasetin ominaisuuksista. Julkaisuissa II ja III keskitytään naisen lisääntymisterveydelle tärkeiden kohdun stroomasolujen ja NK-immuunisolujen solutyyppispesifiseen analyysiin. Artikkelissa II yhdistettiin olemassa olevia tuloksia pre-eklampsiasta viimeisimpiin yksisolusekvensointituloksiin ja löydettiin varhain alkavan pre-eklampsian (EOP) ja myöhään alkavan pre-eklampsian (LOP) solutyyppispesifisiä vaikutuksia. Havaittiin, että erilaistuneen strooman markkerigeenien ilmentyminen vähentyi LOP:ssa ja NK-markkerigeenien ilmentyminen lisääntyi EOP:ssa. Julkaisu III analysoi strooman ja NK-solujen alapopulaatiospesifisiä geeninsäätelyverkostoja ja niiden transkriptiofaktoreita. Tutkimus tunnisti uusia alapopulaatiospesifisiä säätelijöitä, jotka edistävät strooman erilaistumista ja NK-soluvälitteistä immunotoleranssia Julkaisu IV tarkastelee yksityiskohtaisesti strategioita ja menetelmiä erilaisten yksisoludatatasojen (multi-omiikka) integroimiseksi. Integrointimenetelmät ryhmiteltiin varhaisen, myöhäisen ja välivaiheen strategioihin ja kunkin lähestymistavan menetelmiä esiteltiin tarkemmin. Lisäksi keskusteltiin mahdollisista tulevaisuuden suunnista

    Contributions to the study of Austism Spectrum Brain conectivity

    Get PDF
    164 p.Autism Spectrum Disorder (ASD) is a largely prevalent neurodevelopmental condition with a big social and economical impact affecting the entire life of families. There is an intense search for biomarkers that can be assessed as early as possible in order to initiate treatment and preparation of the family to deal with the challenges imposed by the condition. Brain imaging biomarkers have special interest. Specifically, functional connectivity data extracted from resting state functional magnetic resonance imaging (rs-fMRI) should allow to detect brain connectivity alterations. Machine learning pipelines encompass the estimation of the functional connectivity matrix from brain parcellations, feature extraction and building classification models for ASD prediction. The works reported in the literature are very heterogeneous from the computational and methodological point of view. In this Thesis we carry out a comprehensive computational exploration of the impact of the choices involved while building these machine learning pipelines

    Clusterwise Independent Component Analysis (C-ICA): using fMRI resting state networks to cluster subjects and find neurofunctional subtypes

    Get PDF
    Background: FMRI resting state networks (RSNs) are used to characterize brain disorders. They also show extensive heterogeneity across patients. Identifying systematic differences between RSNs in patients, i.e. discovering neurofunctional subtypes, may further increase our understanding of disease heterogeneity. Currently, no methodology is available to estimate neurofunctional subtypes and their associated RSNs simultaneously.New method: We present an unsupervised learning method for fMRI data, called Clusterwise Independent Component Analysis (C-ICA). This enables the clustering of patients into neurofunctional subtypes based on differences in shared ICA-derived RSNs. The parameters are estimated simultaneously, which leads to an improved estimation of subtypes and their associated RSNs.Results: In five simulation studies, the C-ICA model is successfully validated using both artificially and realistically simulated data (N = 30-40). The successful performance of the C-ICA model is also illustrated on an empirical data set consisting of Alzheimer's disease patients and elderly control subjects (N = 250). C-ICA is able to uncover a meaningful clustering that partially matches (balanced accuracy = .72) the diagnostic labels and identifies differences in RSNs between the Alzheimer and control cluster. Comparison with other methods: Both in the simulation study and the empirical application, C-ICA yields better results compared to competing clustering methods (i.e., a two step clustering procedure based on single subject ICA's and a Group ICA plus dual regression variant thereof) that do not simultaneously estimate a clustering and associated RSNs. Indeed, the overall mean adjusted Rand Index, a measure for cluster recovery, equals 0.65 for C-ICA and ranges from 0.27 to 0.46 for competing methods.Conclusions: The successful performance of C-ICA indicates that it is a promising method to extract neuro-functional subtypes from multi-subject resting state-fMRI data. This method can be applied on fMRI scans of patient groups to study (neurofunctional) subtypes, which may eventually further increase understanding of disease heterogeneity.Multivariate analysis of psychological dat

    Quantum Approaches to Data Science and Data Analytics

    Get PDF
    In this thesis are explored different research directions related to both the use of classical data analysis techniques for the study of quantum systems and the employment of quantum computing to speed up hard Machine Learning task

    Aspects of localization in disordered many-body quantum systems

    Get PDF
    For a quantum system to be permanently out-of-equilibrium, some non-trivial mechanism must be at play, to counteract the general tendency of entropy increase and flow toward equilibration. Among the possible ways to protect a system against local thermalization, the phenomenon of localization induced by quenched disorder appears to be one of the most promising. Although the problem of localization was introduced almost sixty years ago, its many-body version is still partly unresolved, despite the recent theoretical effort to tackle it. In this thesis we address a few aspects of the localized phase, mainly focusing on the interacting case. A large part of the thesis is devoted to investigating the underlying \u201cintegrable\u201d structure of many-body localized systems, i.e., the existence of non-trivial conservation laws that prevent ergodicity and thermalization. In particular, we show that such conserved operators can be explicitly constructed by dressing perturbatively the non-interacting conserved quantities, in a procedure that converges when scattering processes are weak enough. This is reminiscent of the quasiparticle theory in Fermi liquids, although in the disordered case the construction extends to the full many-body energy spectrum, and it results in operators that are exactly conserved. As an example of how to use the constructive recipe for the conserved quantities, we compute the long-time limit of an order parameter for the MBL phase in antiferromagnetic spin systems. Similar analytical tools as the ones exploited for the construction of the conserved operators are then applied to the problem of the stability of single-particle localization with respect to the coupling to a finite bath. In this context, we identify a quantum-Zeno-type effect, whereby the bath unexpectedly enhances the particle\u2019s localization. In the final part of the thesis, we discuss several mechanisms by which thermal fluctuations may influence the spatial localization of excitations in interacting many-body states

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi

    Learning Representations of Social Media Users

    Get PDF
    User representations are routinely used in recommendation systems by platform developers, targeted advertisements by marketers, and by public policy researchers to gauge public opinion across demographic groups. Computer scientists consider the problem of inferring user representations more abstractly; how does one extract a stable user representation - effective for many downstream tasks - from a medium as noisy and complicated as social media? The quality of a user representation is ultimately task-dependent (e.g. does it improve classifier performance, make more accurate recommendations in a recommendation system) but there are proxies that are less sensitive to the specific task. Is the representation predictive of latent properties such as a person's demographic features, socioeconomic class, or mental health state? Is it predictive of the user's future behavior? In this thesis, we begin by showing how user representations can be learned from multiple types of user behavior on social media. We apply several extensions of generalized canonical correlation analysis to learn these representations and evaluate them at three tasks: predicting future hashtag mentions, friending behavior, and demographic features. We then show how user features can be employed as distant supervision to improve topic model fit. Finally, we show how user features can be integrated into and improve existing classifiers in the multitask learning framework. We treat user representations - ground truth gender and mental health features - as auxiliary tasks to improve mental health state prediction. We also use distributed user representations learned in the first chapter to improve tweet-level stance classifiers, showing that distant user information can inform classification tasks at the granularity of a single message.Comment: PhD thesi