379 research outputs found
Topic Modeling on Health Journals with Regularized Variational Inference
Topic modeling enables exploration and compact representation of a corpus.
The CaringBridge (CB) dataset is a massive collection of journals written by
patients and caregivers during a health crisis. Topic modeling on the CB
dataset, however, is challenging due to the asynchronous nature of multiple
authors writing about their health journeys. To overcome this challenge we
introduce the Dynamic Author-Persona topic model (DAP), a probabilistic
graphical model designed for temporal corpora with multiple authors. The
novelty of the DAP model lies in its representation of authors by a persona ---
where personas capture the propensity to write about certain topics over time.
Further, we present a regularized variational inference algorithm, which we use
to encourage the DAP model's personas to be distinct. Our results show
significant improvements over competing topic models --- particularly after
regularization, and highlight the DAP model's unique ability to capture common
journeys shared by different authors.Comment: Published in Thirty-Second AAAI Conference on Artificial
Intelligence, February 2018, New Orleans, Louisiana, US
Dirichlet belief networks for topic structure learning
Recently, considerable research effort has been devoted to developing deep
architectures for topic models to learn topic structures. Although several deep
models have been proposed to learn better topic proportions of documents, how
to leverage the benefits of deep structures for learning word distributions of
topics has not yet been rigorously studied. Here we propose a new multi-layer
generative process on word distributions of topics, where each layer consists
of a set of topics and each topic is drawn from a mixture of the topics of the
layer above. As the topics in all layers can be directly interpreted by words,
the proposed model is able to discover interpretable topic hierarchies. As a
self-contained module, our model can be flexibly adapted to different kinds of
topic models to improve their modelling accuracy and interpretability.
Extensive experiments on text corpora demonstrate the advantages of the
proposed model.Comment: accepted in NIPS 201
Advancing Biomedicine with Graph Representation Learning: Recent Progress, Challenges, and Future Directions
Graph representation learning (GRL) has emerged as a pivotal field that has
contributed significantly to breakthroughs in various fields, including
biomedicine. The objective of this survey is to review the latest advancements
in GRL methods and their applications in the biomedical field. We also
highlight key challenges currently faced by GRL and outline potential
directions for future research.Comment: Accepted by 2023 IMIA Yearbook of Medical Informatic
Probabilistic Models and Natural Language Processing in Health
The treatment of mental disorders nowadays entails a wide variety of still non-solved
tasks such as misdiagnosis or delayed diagnosis. During this doctoral thesis we study and
develop different models that can serve as potential tools for the clinician labor. Among
our proposals, we outline two main lines of research, Natural Language Processing and
probabilistic methods.
In Chapter 2, we start our thesis with a regularization mechanism used in language
models and specially effective in Transformer-based architectures, where we call it NoRBERT,
from Noisy Regularized Bidirectional Representations from Transformers [9], [15].
According to the literature, we found out that regularization in NLP is a low explored
field limited to the use of general mechanisms such as dropout [57] or early stopping
[58]. In this landscape, we propose a novel approach to combine any LM with Variational
Auto-Encoders [23]. VAEs belong to deep generative models, with the construction of
a regular latent space that permits the reconstruction of the input samples throughout an
encoder and decoder networks. Our VAE is based in a prior distribution of a mixture
of Gaussians (GMVAE), what gives the model the chance to capture some multimodal
information. Combining both, Transformers and GMVAEs we build an architecture capable
of imputing missing words from a text corpora in a diverse topic space as well as
improve BLEU score in the reconstruction of the data base. Both results depend on the
depth of the regularized layer from the Transformer Encoder. The regularization in essence
is formed by the GMVAE reconstruction of the Transformer embeddings at some point in
the architecture, adding structure noise that helps the model a better generalization. We
show improvements in BERT[15], RoBERTa [16] and XLM-R [17] models, verified in
different datasets and we also provide explicit examples of sentences reconstructed by
Top NoRBERT. In addition, we validate the abilities of our model in data augmentation,
improving classification accuracy and F1 score in various datasets and scenarios thanks
to augmented samples generated by NoRBERT. We study some variations in the model,
Top, Deep and contextual NoRBERT, the latter based in the use of contextual words to
reconstruct the embeddings in the corresponding Transformer layer.
We continue with the Transformers line of research in Chapter 3, proposing PsyBERT.
PsyBERT, as the own name refers, is a BERT-based [15] architecture suitably modified
to work in Electronic Health Records from psychiatry patients. It is inspired by BEHRT [19], also devoted to EHRs in general health. We distinguish our model from the training
methodology and the embedding layer. In a similar way that with NoRBERT, we find
the utility of using a Masked Language Modeling (MLM) policy without no finetuning or
specific-task layer at all. On the one hand, we used MLM in NoRBERT to solve the task
of imputing missing words, finishing the aim of the model in generating new sentences by
inputs with missing information. On the other hand, we firstly propose the use of PsyBERT
such as tool to fill the missing diagnoses in the EHR as well as correct misdiagnosed
cases. After this task, we also apply PsyBERT in delusional disorder detection. On the
contrary, in this scenario we apply a multi-label classification layer, that aims to compute
the probability of the different diagnoses in the last visit of the patient to the hospital.
From these probabilities, we analyse delusional cases and propose a tool to detect potential
candidates of this mental disorder. In both tasks, we make use of several fields obtained
from the patient EHR, such as age, sex, diagnoses, treatments of psychiatric history and
propose a method capable of combining heterogeneous data to help the diagnosis in mental
health. During these works, we point out the problematic in the quality of the data from
the EHRs [104], [105] and the great advantage that medical assistance tools like our
model can provide. We do not only solve a classification problem with more than 700
different illnesses, but we bring a model to help doctors in the diagnosis of very complex
scenarios, with comorbidity, long periods of patient exploration by traditional methodology
or low prevalence cases. We present a powerful method treating a problematic with great
necessity.
Following the health line of research and psychiatry application, we analyse in Chapter
4 a probabilistic method to search for behavioral pattern in patients also with mental
disorders. In this case it is not the method the contribution of the work but the application
and results in collaboration with the clinician interpretation. The model is called SPFM
(Sparse Poisson Factorization Model) [22] and consist on a non-parametric probabilistic
model based on the Indian Buffet Process (IBP) [20], [21]. It is a exploratory method
capable of decomposing the input data in sparse matrixes. For that, it imposes the Poisson
distribution to the product of two matrixes, Z and B, both obtained respectively by the IBP
and a Gamma distribution. Hence Z corresponds to a binary matrix representing active
latent features in a patient data and B weights the contribution of the data characteristics to
the latent features. The data we use in the three works described during the chapter refers
to different questions from e-health questionnaries. Then, the data characteristics refer to
the answer or punctuation on each question and the latent features from different behavioral
patterns in a patient regarding the selection of features active in their questionnaires. For
example, patient X can present feature 1 and 2 and patient Y may presence feature 1
and 3, giving as a result two different profiles of behavioral. With these procedure we
study three scenarios. In the first problematic, we relate the profiles with the diagnoses,
finding common patterns among the patients and connections between diseases. We also
analyse the grade of critical state and contrast the clinician judgment via the Clinical
Global Impression (CGI). In the second scenario, we pursue a similar study and find
out connections between disturbed sleeping patterns and clinical markers of wish to die. We focus this analysis in patients with suicidal thoughts due to the problematic that
those individuals suppose as a major public health issue [175]. In this case we vary
the questionnarie and the data sample, obtaining different profiles also with important
information to interpret by the psychiatrist. The main contribution of this work is the
proportion of a mechanism capable of helping with detection and prevention of suicide.
Finally, the third work comprehend a behavioral pattern study in mental health patient
before and during covid-19 lockdown. We did not want to lose the chance to contribute
during coronavirus disease outbreak and presented a study about the changes in psychiatric
patients during the alarm state. We analyse again the profiles with the previous e-health
questionnaire and discover that the self-reported suicide risk decreased during the lockdown.
These results contrast with others studies [237] and suppose signs for an increase in suicidal
ideation once the crisis ceases.
Finally, Chapter 5 propose a regularization mechanism based in a theoretical idea from
[245] to obtain a variance reduction in the real risk. We interpret the robust regularized
risk that those authors propose in a two-step mechanism formed by the minimization of the
weighted risk and the maximization of a robust objective and suggest an idea to apply this
methodology in a way to select the samples from the mini-batch in a deep learning set up.
We study different variations of repeating the worst performed samples from the previous
mini-bath during the training procedure and show proves of improvements in the accuracy
and faster convergence rates of a image classification problem with different architectures
and datasets.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: JoaquÃn MÃguez Arenas.- Secretario: Francisco Jesús RodrÃguez Ruiz.- Vocal: Santiago Ovejero GarcÃ
Variational Counterfactual Prediction under Runtime Domain Corruption
To date, various neural methods have been proposed for causal effect
estimation based on observational data, where a default assumption is the same
distribution and availability of variables at both training and inference
(i.e., runtime) stages. However, distribution shift (i.e., domain shift) could
happen during runtime, and bigger challenges arise from the impaired
accessibility of variables. This is commonly caused by increasing privacy and
ethical concerns, which can make arbitrary variables unavailable in the entire
runtime data and imputation impractical. We term the co-occurrence of domain
shift and inaccessible variables runtime domain corruption, which seriously
impairs the generalizability of a trained counterfactual predictor. To counter
runtime domain corruption, we subsume counterfactual prediction under the
notion of domain adaptation. Specifically, we upper-bound the error w.r.t. the
target domain (i.e., runtime covariates) by the sum of source domain error and
inter-domain distribution distance. In addition, we build an adversarially
unified variational causal effect model, named VEGAN, with a novel two-stage
adversarial domain adaptation scheme to reduce the latent distribution
disparity between treated and control groups first, and between training and
runtime variables afterwards. We demonstrate that VEGAN outperforms other
state-of-the-art baselines on individual-level treatment effect estimation in
the presence of runtime domain corruption on benchmark datasets
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
Machine Learning and Finance: A Review using Latent Dirichlet Allocation Technique (LDA)
The aim of this paper is provide a first comprehensive structuring of the literature applying machine learning to finance. We use a probabilistic topic modelling approach to make sense of this diverse body of research spanning across the disciplines of finance, economics, computer sciences, and decision sciences. Through the topic modelling approach, a Latent Dirichlet Allocation Technique (LDA), we can extract the 14 coherent research topics that are the focus of the 6,148 academic articles during the years 1990-2019 analysed. We first describe and structure these topics, and then further show how the topic focus has evolved over the last two decades. Our study thus provides a structured topography for finance researchers seeking to integrate machine learning research approaches in their exploration of finance phenomena. We also showcase the benefits to finance researchers of the method of probabilistic modelling of topics for deep comprehension of a body of literature, especially when that literature has diverse multi-disciplinary actors
- …