66 research outputs found
SWIFT: Scalable Wasserstein Factorization for Sparse Nonnegative Tensors
Existing tensor factorization methods assume that the input tensor follows
some specific distribution (i.e. Poisson, Bernoulli, and Gaussian), and solve
the factorization by minimizing some empirical loss functions defined based on
the corresponding distribution. However, it suffers from several drawbacks: 1)
In reality, the underlying distributions are complicated and unknown, making it
infeasible to be approximated by a simple distribution. 2) The correlation
across dimensions of the input tensor is not well utilized, leading to
sub-optimal performance. Although heuristics were proposed to incorporate such
correlation as side information under Gaussian distribution, they can not
easily be generalized to other distributions. Thus, a more principled way of
utilizing the correlation in tensor factorization models is still an open
challenge. Without assuming any explicit distribution, we formulate the tensor
factorization as an optimal transport problem with Wasserstein distance, which
can handle non-negative inputs.
We introduce SWIFT, which minimizes the Wasserstein distance that measures
the distance between the input tensor and that of the reconstruction. In
particular, we define the N-th order tensor Wasserstein loss for the widely
used tensor CP factorization and derive the optimization algorithm that
minimizes it. By leveraging sparsity structure and different equivalent
formulations for optimizing computational efficiency, SWIFT is as scalable as
other well-known CP algorithms. Using the factor matrices as features, SWIFT
achieves up to 9.65% and 11.31% relative improvement over baselines for
downstream prediction tasks. Under the noisy conditions, SWIFT achieves up to
15% and 17% relative improvements over the best competitors for the prediction
tasks.Comment: Accepted by AAAI-2
Recommended from our members
Mining structured matrices in high dimensions
Structured matrices refer to matrix valued data that are embedded in an inherent lower dimensional manifold with smaller degrees of freedom compared to the ambient or observed dimensions. Such hidden (or latent) structures allow for statistically consistent estimation in high dimensional settings, wherein the number of observations is much smaller than the number of parameters to be estimated. This dissertation makes significant contributions to statistical models, algorithms, and applications of structured matrix estimation in high dimensional settings. The proposed estimators and algorithms are motivated by and evaluated on applications in e--commerce, healthcare, and neuroscience. In the first line of contributions, substantial generalizations of existing results are derived for a widely studied problem of matrix completion. Tractable estimators with strong statistical guarantees are developed for matrix completion under (a) generalized observation models subsuming heterogeneous data--types, such as count, binary, etc., and heterogeneous noise models beyond additive Gaussian, (b) general structural constraints beyond low rank assumptions, and (c) collective estimation from multiple sources of data. The second line of contributions focuses on the algorithmic and application specific ideas for generalized structured matrix estimation. Two specific applications of structured matrix estimation are discussed: (a) a constrained latent factor estimation framework that extends the ideas and techniques hitherto discussed, and applies them for the task of learning clinically relevant phenotypes from Electronic Health Records (EHRs), and (b) a novel, efficient, and highly generalized algorithm for collaborative learning to rank (LETOR) applications.Electrical and Computer Engineerin
Unsupervised learning methods for identifying and evaluating disease clusters in electronic health records
Introduction
Clustering algorithms are a class of algorithms that can discover groups of observations in
complex data and are often used to identify subtypes of heterogeneous diseases in electronic
health records (EHR). Evaluating clustering experiments for biological and clinical significance is
a vital but challenging task due to the lack of consensus on best practices. As a result, the
translation of findings from clustering experiments to clinical practice is limited.
Aim
The aim of this thesis was to investigate and evaluate approaches that enable the evaluation of
clustering experiments using EHR.
Methods
We conducted a scoping review of clustering studies in EHR to identify common evaluation
approaches. We systematically investigated the performance of the identified approaches using
a cohort of Alzheimer's Disease (AD) patients as an exemplar comparing four different
clustering methods (K-means, Kernel K-means, Affinity Propagation and Latent Class
Analysis.). Using the same population, we developed and evaluated a method (MCHAMMER)
that tested whether clusterable structures exist in EHR. To develop this method we tested
several cluster validation indexes and methods of generating null data to see which are the best
at discovering clusters. In order to enable the robust benchmarking of evaluation approaches,
we created a tool that generated synthetic EHR data that contain known cluster labels across a
range of clustering scenarios.
Results
Across 67 EHR clustering studies, the most popular internal evaluation metric was comparing
cluster results across multiple algorithms (30% of studies). We examined this approach
conducting a clustering experiment on AD patients using a population of 10,065 AD patients and
21 demographic, symptom and comorbidity features. K-means found 5 clusters, Kernel K means found 2 clusters, Affinity propagation found 5 and latent class analysis found 6. K-means
4
was found to have the best clustering solution with the highest silhouette score (0.19) and was
more predictive of outcomes. The five clusters found were: typical AD (n=2026), non-typical AD
(n=1640), cardiovascular disease cluster (n=686), a cancer cluster (n=1710) and a cluster of
mental health issues, smoking and early disease onset (n=1528), which has been found in
previous research as well as in the results of other clustering methods. We created a synthetic
data generation tool which allows for the generation of realistic EHR clusters that can vary in
separation and number of noise variables to alter the difficulty of the clustering problem. We
found that decreasing cluster separation did increase cluster difficulty significantly whereas
noise variables increased cluster difficulty but not significantly. To develop the tool to assess
clusters existence we tested different methods of null dataset generation and cluster validation
indices, the best performing null dataset method was the min max method and the best
performing indices we Calinksi Harabasz index which had an accuracy of 94%, Davies Bouldin
index (97%) silhouette score ( 93%) and BWC index (90%). We further found that when clusters
were identified using the Calinski Harabasz index they were more likely to have significantly
different outcomes between clusters. Lastly we repeated the initial clustering experiment,
comparing 10 different pre-processing methods. The three best performing methods were RBF
kernel (2 clusters), MCA (4 clusters) and MCA and PCA (6 clusters). The MCA approach gave
the best results highest silhouette score (0.23) and meaningful clusters, producing 4 clusters;
heart and circulatory( n=1379), early onset mental health (n=1761), male cluster with memory
loss (n = 1823), female with more problem (n=2244).
Conclusion
We have developed and tested a series of methods and tools to enable the evaluation of EHR
clustering experiments. We developed and proposed a novel cluster evaluation metric and
provided a tool for benchmarking evaluation approaches in synthetic but realistic EHR
Repeatable and reusable research - Exploring the needs of users for a Data Portal for Disease Phenotyping
Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it hard to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This thesis aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, for both new and existing data portals for phenotypes (concept libraries). Methods: Exploratory sequential mixed methods were used in this thesis to look at which concept libraries are available, how they are used, what their characteristics are, where there are gaps, and what needs to be done in the future from the point of view of the people who use them. This thesis consists of three phases: 1) two qualitative studies, including one-to-one interviews with researchers, clinicians, machine learning experts, and senior research managers in health data science, as well as focus group discussions with researchers working with the Secured Anonymized Information Linkage databank, 2) the creation of an email survey (i.e., the Concept Library Usability Scale), and 3) a quantitative study with researchers, health professionals, and clinicians. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would encourage them to: 1) share their work, such as receiving citations from other researchers; and 2) reuse the work of others, such as saving a lot of time and effort, which they frequently spend on creating new code lists from scratch. They also pointed out several barriers that could inhibit them from: 1) sharing their work, such as concerns about intellectual property (e.g., if they shared their methods before publication, other researchers would use them as their own); and 2) reusing others' work, such as a lack of confidence in the quality and validity of their code lists. Participants suggested some developments that they would like to see happen in order to make research that is done with routine data more reproducible, such as the availability of a drive for more transparency in research methods documentation, such as publishing complete phenotype definitions and clear code lists. Conclusions: The findings of this thesis indicated that most participants valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform such as the CALIBER research platform. Analysis of interviews, focus group discussions, and qualitative studies revealed that different users have different requirements, facilitators, barriers, and concerns about concept libraries. This work was to investigate if we should develop concept libraries in Kuwait to facilitate the development of improved data sharing. However, at the end of this thesis the recommendation is this would be unlikely to be cost effective or highly valued by users and investment in open access research publications may be of more value to the Kuwait research/academic community
Temperament & Character account for brain functional connectivity at rest: A diathesis-stress model of functional dysregulation in psychosis
The online version contains supplementary material
available at https://doi.org/10.1038/s41380-023-02039-6The human brain’s resting-state functional connectivity (rsFC) provides stable trait-like measures of differences in the perceptual,
cognitive, emotional, and social functioning of individuals. The rsFC of the prefrontal cortex is hypothesized to mediate a person’s
rational self-government, as is also measured by personality, so we tested whether its connectivity networks account for
vulnerability to psychosis and related personality configurations. Young adults were recruited as outpatients or controls from the
same communities around psychiatric clinics. Healthy controls (n = 30) and clinically stable outpatients with bipolar disorder
(n = 35) or schizophrenia (n = 27) were diagnosed by structured interviews, and then were assessed with standardized protocols of
the Human Connectome Project. Data-driven clustering identified five groups of patients with distinct patterns of rsFC regardless of
diagnosis. These groups were distinguished by rsFC networks that regulate specific biopsychosocial aspects of psychosis: sensory
hypersensitivity, negative emotional balance, impaired attentional control, avolition, and social mistrust. The rsFc group differences
were validated by independent measures of white matter microstructure, personality, and clinical features not used to identify the
subjects. We confirmed that each connectivity group was organized by differential collaborative interactions among six prefrontal
and eight other automatically-coactivated networks. The temperament and character traits of the members of these groups
strongly accounted for the differences in rsFC between groups, indicating that configurations of rsFC are internal representations of
personality organization. These representations involve weakly self-regulated emotional drives of fear, irrational desire, and
mistrust, which predispose to psychopathology. However, stable outpatients with different diagnoses (bipolar or schizophrenic
psychoses) were highly similar in rsFC and personality. This supports a diathesis-stress model in which different complex adaptive
systems regulate predisposition (which is similar in stable outpatients despite diagnosis) and stress-induced clinical dysfunction
(which differs by diagnosis).EU FEDER grants through the Spanish Ministry of Science and Technology
PID2021-125017OB-I00,
RTI2018-098983-B-I00,
D43 TW011793-06A1,
PID2021-125017OB-I00,
RTI2018-098983-B-I00,
D43 TW011793-06A1United States Department of Health & Human Services
National Institutes of Health (NIH) - USA
R01-MH124060Psychosis-Risk Outcomes Network
U01 MH12463
Temperament & Character account for brain functional connectivity at rest: A diathesis-stress model of functional dysregulation in psychosis
The human brain’s resting-state functional connectivity (rsFC) provides stable trait-like measures of differences in the perceptual, cognitive, emotional, and social functioning of individuals. The rsFC of the prefrontal cortex is hypothesized to mediate a person’s rational self-government, as is also measured by personality, so we tested whether its connectivity networks account for vulnerability to psychosis and related personality configurations. Young adults were recruited as outpatients or controls from the same communities around psychiatric clinics. Healthy controls (n = 30) and clinically stable outpatients with bipolar disorder (n = 35) or schizophrenia (n = 27) were diagnosed by structured interviews, and then were assessed with standardized protocols of the Human Connectome Project. Data-driven clustering identified five groups of patients with distinct patterns of rsFC regardless of diagnosis. These groups were distinguished by rsFC networks that regulate specific biopsychosocial aspects of psychosis: sensory hypersensitivity, negative emotional balance, impaired attentional control, avolition, and social mistrust. The rsFc group differences were validated by independent measures of white matter microstructure, personality, and clinical features not used to identify the subjects. We confirmed that each connectivity group was organized by differential collaborative interactions among six prefrontal and eight other automatically-coactivated networks. The temperament and character traits of the members of these groups strongly accounted for the differences in rsFC between groups, indicating that configurations of rsFC are internal representations of personality organization. These representations involve weakly self-regulated emotional drives of fear, irrational desire, and mistrust, which predispose to psychopathology. However, stable outpatients with different diagnoses (bipolar or schizophrenic psychoses) were highly similar in rsFC and personality. This supports a diathesis-stress model in which different complex adaptive systems regulate predisposition (which is similar in stable outpatients despite diagnosis) and stress-induced clinical dysfunction (which differs by diagnosis)
Graph Representation Learning in Biomedicine
Biomedical networks are universal descriptors of systems of interacting
elements, from protein interactions to disease networks, all the way to
healthcare systems and scientific knowledge. With the remarkable success of
representation learning in providing powerful predictions and insights, we have
witnessed a rapid expansion of representation learning techniques into
modeling, analyzing, and learning with such networks. In this review, we put
forward an observation that long-standing principles of networks in biology and
medicine -- while often unspoken in machine learning research -- can provide
the conceptual grounding for representation learning, explain its current
successes and limitations, and inform future advances. We synthesize a spectrum
of algorithmic approaches that, at their core, leverage graph topology to embed
networks into compact vector spaces, and capture the breadth of ways in which
representation learning is proving useful. Areas of profound impact include
identifying variants underlying complex traits, disentangling behaviors of
single cells and their effects on health, assisting in diagnosis and treatment
of patients, and developing safe and effective medicines
- …