66 research outputs found

    SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors

    Full text link
    Existing tensor factorization methods assume that the input tensor follows some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve the factorization by minimizing some empirical loss functions defined based on the corresponding distribution. However, it suffers from several drawbacks: 1) In reality, the underlying distributions are complicated and unknown, making it infeasible to be approximated by a simple distribution. 2) The correlation across dimensions of the input tensor is not well utilized, leading to sub-optimal performance. Although heuristics were proposed to incorporate such correlation as side information under Gaussian distribution, they can not easily be generalized to other distributions. Thus, a more principled way of utilizing the correlation in tensor factorization models is still an open challenge. Without assuming any explicit distribution, we formulate the tensor factorization as an optimal transport problem with Wasserstein distance, which can handle non-negative inputs. We introduce SWIFT, which minimizes the Wasserstein distance that measures the distance between the input tensor and that of the reconstruction. In particular, we define the N-th order tensor Wasserstein loss for the widely used tensor CP factorization and derive the optimization algorithm that minimizes it. By leveraging sparsity structure and different equivalent formulations for optimizing computational efficiency, SWIFT is as scalable as other well-known CP algorithms. Using the factor matrices as features, SWIFT achieves up to 9.65% and 11.31% relative improvement over baselines for downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to 15% and 17% relative improvements over the best competitors for the prediction tasks.Comment: Accepted by AAAI-2

    Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records

    Get PDF
    Introduction Clustering algorithms are a class of algorithms that can discover groups of observations in complex data and are often used to identify subtypes of heterogeneous diseases in electronic health records (EHR). Evaluating clustering experiments for biological and clinical significance is a vital but challenging task due to the lack of consensus on best practices. As a result, the translation of findings from clustering experiments to clinical practice is limited. Aim The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of clustering experiments using EHR. Methods We conducted a scoping review of clustering studies in EHR to identify common evaluation approaches. We systematically investigated the performance of the identified approaches using a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER) that tested whether clusterable structures exist in EHR. To develop this method we tested several cluster validation indexes and methods of generating null data to see which are the best at discovering clusters. In order to enable the robust benchmarking of evaluation approaches, we created a tool that generated synthetic EHR data that contain known cluster labels across a range of clustering scenarios. Results Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing cluster results across multiple algorithms (30% of studies). We examined this approach conducting a clustering experiment on AD patients using a population of 10,065 AD patients and 21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means 4 was found to have the best clustering solution with the highest silhouette score (0.19) and was more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD (n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of mental health issues, smoking and early disease onset (n=1528), which has been found in previous research as well as in the results of other clustering methods. We created a synthetic data generation tool which allows for the generation of realistic EHR clusters that can vary in separation and number of noise variables to alter the difficulty of the clustering problem. We found that decreasing cluster separation did increase cluster difficulty significantly whereas noise variables increased cluster difficulty but not significantly. To develop the tool to assess clusters existence we tested different methods of null dataset generation and cluster validation indices, the best performing null dataset method was the min max method and the best performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters were identified using the Calinski Harabasz index they were more likely to have significantly different outcomes between clusters. Lastly we repeated the initial clustering experiment, comparing 10 different pre-processing methods. The three best performing methods were RBF kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters; heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory loss (n = 1823), female with more problem (n=2244). Conclusion We have developed and tested a series of methods and tools to enable the evaluation of EHR clustering experiments. We developed and proposed a novel cluster evaluation metric and provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR

    Repeatable and reusable research - Exploring the needs of users for a Data Portal for Disease Phenotyping

    Get PDF
    Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it hard to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This thesis aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, for both new and existing data portals for phenotypes (concept libraries). Methods: Exploratory sequential mixed methods were used in this thesis to look at which concept libraries are available, how they are used, what their characteristics are, where there are gaps, and what needs to be done in the future from the point of view of the people who use them. This thesis consists of three phases: 1) two qualitative studies, including one-to-one interviews with researchers, clinicians, machine learning experts, and senior research managers in health data science, as well as focus group discussions with researchers working with the Secured Anonymized Information Linkage databank, 2) the creation of an email survey (i.e., the Concept Library Usability Scale), and 3) a quantitative study with researchers, health professionals, and clinicians. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would encourage them to: 1) share their work, such as receiving citations from other researchers; and 2) reuse the work of others, such as saving a lot of time and effort, which they frequently spend on creating new code lists from scratch. They also pointed out several barriers that could inhibit them from: 1) sharing their work, such as concerns about intellectual property (e.g., if they shared their methods before publication, other researchers would use them as their own); and 2) reusing others' work, such as a lack of confidence in the quality and validity of their code lists. Participants suggested some developments that they would like to see happen in order to make research that is done with routine data more reproducible, such as the availability of a drive for more transparency in research methods documentation, such as publishing complete phenotype definitions and clear code lists. Conclusions: The findings of this thesis indicated that most participants valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform such as the CALIBER research platform. Analysis of interviews, focus group discussions, and qualitative studies revealed that different users have different requirements, facilitators, barriers, and concerns about concept libraries. This work was to investigate if we should develop concept libraries in Kuwait to facilitate the development of improved data sharing. However, at the end of this thesis the recommendation is this would be unlikely to be cost effective or highly valued by users and investment in open access research publications may be of more value to the Kuwait research/academic community

    Temperament & Character account for brain functional connectivity at rest: A diathesis-stress model of functional dysregulation in psychosis

    Get PDF
    The online version contains supplementary material available at https://doi.org/10.1038/s41380-023-02039-6The human brain’s resting-state functional connectivity (rsFC) provides stable trait-like measures of differences in the perceptual, cognitive, emotional, and social functioning of individuals. The rsFC of the prefrontal cortex is hypothesized to mediate a person’s rational self-government, as is also measured by personality, so we tested whether its connectivity networks account for vulnerability to psychosis and related personality configurations. Young adults were recruited as outpatients or controls from the same communities around psychiatric clinics. Healthy controls (n = 30) and clinically stable outpatients with bipolar disorder (n = 35) or schizophrenia (n = 27) were diagnosed by structured interviews, and then were assessed with standardized protocols of the Human Connectome Project. Data-driven clustering identified five groups of patients with distinct patterns of rsFC regardless of diagnosis. These groups were distinguished by rsFC networks that regulate specific biopsychosocial aspects of psychosis: sensory hypersensitivity, negative emotional balance, impaired attentional control, avolition, and social mistrust. The rsFc group differences were validated by independent measures of white matter microstructure, personality, and clinical features not used to identify the subjects. We confirmed that each connectivity group was organized by differential collaborative interactions among six prefrontal and eight other automatically-coactivated networks. The temperament and character traits of the members of these groups strongly accounted for the differences in rsFC between groups, indicating that configurations of rsFC are internal representations of personality organization. These representations involve weakly self-regulated emotional drives of fear, irrational desire, and mistrust, which predispose to psychopathology. However, stable outpatients with different diagnoses (bipolar or schizophrenic psychoses) were highly similar in rsFC and personality. This supports a diathesis-stress model in which different complex adaptive systems regulate predisposition (which is similar in stable outpatients despite diagnosis) and stress-induced clinical dysfunction (which differs by diagnosis).EU FEDER grants through the Spanish Ministry of Science and Technology PID2021-125017OB-I00, RTI2018-098983-B-I00, D43 TW011793-06A1, PID2021-125017OB-I00, RTI2018-098983-B-I00, D43 TW011793-06A1United States Department of Health & Human Services National Institutes of Health (NIH) - USA R01-MH124060Psychosis-Risk Outcomes Network U01 MH12463

    Temperament & Character account for brain functional connectivity at rest: A diathesis-stress model of functional dysregulation in psychosis

    Get PDF
    The human brain’s resting-state functional connectivity (rsFC) provides stable trait-like measures of differences in the perceptual, cognitive, emotional, and social functioning of individuals. The rsFC of the prefrontal cortex is hypothesized to mediate a person’s rational self-government, as is also measured by personality, so we tested whether its connectivity networks account for vulnerability to psychosis and related personality configurations. Young adults were recruited as outpatients or controls from the same communities around psychiatric clinics. Healthy controls (n = 30) and clinically stable outpatients with bipolar disorder (n = 35) or schizophrenia (n = 27) were diagnosed by structured interviews, and then were assessed with standardized protocols of the Human Connectome Project. Data-driven clustering identified five groups of patients with distinct patterns of rsFC regardless of diagnosis. These groups were distinguished by rsFC networks that regulate specific biopsychosocial aspects of psychosis: sensory hypersensitivity, negative emotional balance, impaired attentional control, avolition, and social mistrust. The rsFc group differences were validated by independent measures of white matter microstructure, personality, and clinical features not used to identify the subjects. We confirmed that each connectivity group was organized by differential collaborative interactions among six prefrontal and eight other automatically-coactivated networks. The temperament and character traits of the members of these groups strongly accounted for the differences in rsFC between groups, indicating that configurations of rsFC are internal representations of personality organization. These representations involve weakly self-regulated emotional drives of fear, irrational desire, and mistrust, which predispose to psychopathology. However, stable outpatients with different diagnoses (bipolar or schizophrenic psychoses) were highly similar in rsFC and personality. This supports a diathesis-stress model in which different complex adaptive systems regulate predisposition (which is similar in stable outpatients despite diagnosis) and stress-induced clinical dysfunction (which differs by diagnosis)

    Graph Representation Learning in Biomedicine

    Full text link
    Biomedical networks are universal descriptors of systems of interacting elements, from protein interactions to disease networks, all the way to healthcare systems and scientific knowledge. With the remarkable success of representation learning in providing powerful predictions and insights, we have witnessed a rapid expansion of representation learning techniques into modeling, analyzing, and learning with such networks. In this review, we put forward an observation that long-standing principles of networks in biology and medicine -- while often unspoken in machine learning research -- can provide the conceptual grounding for representation learning, explain its current successes and limitations, and inform future advances. We synthesize a spectrum of algorithmic approaches that, at their core, leverage graph topology to embed networks into compact vector spaces, and capture the breadth of ways in which representation learning is proving useful. Areas of profound impact include identifying variants underlying complex traits, disentangling behaviors of single cells and their effects on health, assisting in diagnosis and treatment of patients, and developing safe and effective medicines
    • …
    corecore