379 research outputs found

    Topic Modeling on Health Journals with Regularized Variational Inference

    Full text link
    Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona --- where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model's personas to be distinct. Our results show significant improvements over competing topic models --- particularly after regularization, and highlight the DAP model's unique ability to capture common journeys shared by different authors.Comment: Published in Thirty-Second AAAI Conference on Artificial Intelligence, February 2018, New Orleans, Louisiana, US

    Dirichlet belief networks for topic structure learning

    Full text link
    Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.Comment: accepted in NIPS 201

    Advancing Biomedicine with Graph Representation Learning: Recent Progress, Challenges, and Future Directions

    Full text link
    Graph representation learning (GRL) has emerged as a pivotal field that has contributed significantly to breakthroughs in various fields, including biomedicine. The objective of this survey is to review the latest advancements in GRL methods and their applications in the biomedical field. We also highlight key challenges currently faced by GRL and outline potential directions for future research.Comment: Accepted by 2023 IMIA Yearbook of Medical Informatic

    Probabilistic Models and Natural Language Processing in Health

    Get PDF
    The treatment of mental disorders nowadays entails a wide variety of still non-solved tasks such as misdiagnosis or delayed diagnosis. During this doctoral thesis we study and develop different models that can serve as potential tools for the clinician labor. Among our proposals, we outline two main lines of research, Natural Language Processing and probabilistic methods. In Chapter 2, we start our thesis with a regularization mechanism used in language models and specially effective in Transformer-based architectures, where we call it NoRBERT, from Noisy Regularized Bidirectional Representations from Transformers [9], [15]. According to the literature, we found out that regularization in NLP is a low explored field limited to the use of general mechanisms such as dropout [57] or early stopping [58]. In this landscape, we propose a novel approach to combine any LM with Variational Auto-Encoders [23]. VAEs belong to deep generative models, with the construction of a regular latent space that permits the reconstruction of the input samples throughout an encoder and decoder networks. Our VAE is based in a prior distribution of a mixture of Gaussians (GMVAE), what gives the model the chance to capture some multimodal information. Combining both, Transformers and GMVAEs we build an architecture capable of imputing missing words from a text corpora in a diverse topic space as well as improve BLEU score in the reconstruction of the data base. Both results depend on the depth of the regularized layer from the Transformer Encoder. The regularization in essence is formed by the GMVAE reconstruction of the Transformer embeddings at some point in the architecture, adding structure noise that helps the model a better generalization. We show improvements in BERT[15], RoBERTa [16] and XLM-R [17] models, verified in different datasets and we also provide explicit examples of sentences reconstructed by Top NoRBERT. In addition, we validate the abilities of our model in data augmentation, improving classification accuracy and F1 score in various datasets and scenarios thanks to augmented samples generated by NoRBERT. We study some variations in the model, Top, Deep and contextual NoRBERT, the latter based in the use of contextual words to reconstruct the embeddings in the corresponding Transformer layer. We continue with the Transformers line of research in Chapter 3, proposing PsyBERT. PsyBERT, as the own name refers, is a BERT-based [15] architecture suitably modified to work in Electronic Health Records from psychiatry patients. It is inspired by BEHRT [19], also devoted to EHRs in general health. We distinguish our model from the training methodology and the embedding layer. In a similar way that with NoRBERT, we find the utility of using a Masked Language Modeling (MLM) policy without no finetuning or specific-task layer at all. On the one hand, we used MLM in NoRBERT to solve the task of imputing missing words, finishing the aim of the model in generating new sentences by inputs with missing information. On the other hand, we firstly propose the use of PsyBERT such as tool to fill the missing diagnoses in the EHR as well as correct misdiagnosed cases. After this task, we also apply PsyBERT in delusional disorder detection. On the contrary, in this scenario we apply a multi-label classification layer, that aims to compute the probability of the different diagnoses in the last visit of the patient to the hospital. From these probabilities, we analyse delusional cases and propose a tool to detect potential candidates of this mental disorder. In both tasks, we make use of several fields obtained from the patient EHR, such as age, sex, diagnoses, treatments of psychiatric history and propose a method capable of combining heterogeneous data to help the diagnosis in mental health. During these works, we point out the problematic in the quality of the data from the EHRs [104], [105] and the great advantage that medical assistance tools like our model can provide. We do not only solve a classification problem with more than 700 different illnesses, but we bring a model to help doctors in the diagnosis of very complex scenarios, with comorbidity, long periods of patient exploration by traditional methodology or low prevalence cases. We present a powerful method treating a problematic with great necessity. Following the health line of research and psychiatry application, we analyse in Chapter 4 a probabilistic method to search for behavioral pattern in patients also with mental disorders. In this case it is not the method the contribution of the work but the application and results in collaboration with the clinician interpretation. The model is called SPFM (Sparse Poisson Factorization Model) [22] and consist on a non-parametric probabilistic model based on the Indian Buffet Process (IBP) [20], [21]. It is a exploratory method capable of decomposing the input data in sparse matrixes. For that, it imposes the Poisson distribution to the product of two matrixes, Z and B, both obtained respectively by the IBP and a Gamma distribution. Hence Z corresponds to a binary matrix representing active latent features in a patient data and B weights the contribution of the data characteristics to the latent features. The data we use in the three works described during the chapter refers to different questions from e-health questionnaries. Then, the data characteristics refer to the answer or punctuation on each question and the latent features from different behavioral patterns in a patient regarding the selection of features active in their questionnaires. For example, patient X can present feature 1 and 2 and patient Y may presence feature 1 and 3, giving as a result two different profiles of behavioral. With these procedure we study three scenarios. In the first problematic, we relate the profiles with the diagnoses, finding common patterns among the patients and connections between diseases. We also analyse the grade of critical state and contrast the clinician judgment via the Clinical Global Impression (CGI). In the second scenario, we pursue a similar study and find out connections between disturbed sleeping patterns and clinical markers of wish to die. We focus this analysis in patients with suicidal thoughts due to the problematic that those individuals suppose as a major public health issue [175]. In this case we vary the questionnarie and the data sample, obtaining different profiles also with important information to interpret by the psychiatrist. The main contribution of this work is the proportion of a mechanism capable of helping with detection and prevention of suicide. Finally, the third work comprehend a behavioral pattern study in mental health patient before and during covid-19 lockdown. We did not want to lose the chance to contribute during coronavirus disease outbreak and presented a study about the changes in psychiatric patients during the alarm state. We analyse again the profiles with the previous e-health questionnaire and discover that the self-reported suicide risk decreased during the lockdown. These results contrast with others studies [237] and suppose signs for an increase in suicidal ideation once the crisis ceases. Finally, Chapter 5 propose a regularization mechanism based in a theoretical idea from [245] to obtain a variance reduction in the real risk. We interpret the robust regularized risk that those authors propose in a two-step mechanism formed by the minimization of the weighted risk and the maximization of a robust objective and suggest an idea to apply this methodology in a way to select the samples from the mini-batch in a deep learning set up. We study different variations of repeating the worst performed samples from the previous mini-bath during the training procedure and show proves of improvements in the accuracy and faster convergence rates of a image classification problem with different architectures and datasets.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Joaquín Míguez Arenas.- Secretario: Francisco Jesús Rodríguez Ruiz.- Vocal: Santiago Ovejero Garcí

    Variational Counterfactual Prediction under Runtime Domain Corruption

    Full text link
    To date, various neural methods have been proposed for causal effect estimation based on observational data, where a default assumption is the same distribution and availability of variables at both training and inference (i.e., runtime) stages. However, distribution shift (i.e., domain shift) could happen during runtime, and bigger challenges arise from the impaired accessibility of variables. This is commonly caused by increasing privacy and ethical concerns, which can make arbitrary variables unavailable in the entire runtime data and imputation impractical. We term the co-occurrence of domain shift and inaccessible variables runtime domain corruption, which seriously impairs the generalizability of a trained counterfactual predictor. To counter runtime domain corruption, we subsume counterfactual prediction under the notion of domain adaptation. Specifically, we upper-bound the error w.r.t. the target domain (i.e., runtime covariates) by the sum of source domain error and inter-domain distribution distance. In addition, we build an adversarially unified variational causal effect model, named VEGAN, with a novel two-stage adversarial domain adaptation scheme to reduce the latent distribution disparity between treated and control groups first, and between training and runtime variables afterwards. We demonstrate that VEGAN outperforms other state-of-the-art baselines on individual-level treatment effect estimation in the presence of runtime domain corruption on benchmark datasets

    No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

    Full text link
    Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works

    Machine Learning and Finance: A Review using Latent Dirichlet Allocation Technique (LDA)

    Get PDF
    The aim of this paper is provide a first comprehensive structuring of the literature applying machine learning to finance. We use a probabilistic topic modelling approach to make sense of this diverse body of research spanning across the disciplines of finance, economics, computer sciences, and decision sciences. Through the topic modelling approach, a Latent Dirichlet Allocation Technique (LDA), we can extract the 14 coherent research topics that are the focus of the 6,148 academic articles during the years 1990-2019 analysed. We first describe and structure these topics, and then further show how the topic focus has evolved over the last two decades. Our study thus provides a structured topography for finance researchers seeking to integrate machine learning research approaches in their exploration of finance phenomena. We also showcase the benefits to finance researchers of the method of probabilistic modelling of topics for deep comprehension of a body of literature, especially when that literature has diverse multi-disciplinary actors
    • …
    corecore