Search CORE

6 research outputs found

Recommended from our members

Appropriate, accessible and appealing probabilistic graphical models

Author: Inouye David Iseri
Publication venue
Publication date: 13/12/2017
Field of study

Appropriate - Many multivariate probabilistic models either use independent distributions or dependent Gaussian distributions. Yet, many real-world datasets contain count-valued or non-negative skewed data, e.g. bag-of-words text data and biological sequencing data. Thus, we develop novel probabilistic graphical models for use on count-valued and non-negative data including Poisson graphical models and multinomial graphical models. We develop one generalization that allows for triple-wise or k-wise graphical models going beyond the normal pairwise formulation. Furthermore, we also explore Gaussian-copula graphical models and derive closed-form solutions for the conditional distributions and marginal distributions (both before and after conditioning). Finally, we derive mixture and admixture, or topic model, generalizations of these graphical models to introduce more power and interpretability. Accessible - Previous multivariate models, especially related to text data, often have complex dependencies without a closed form and require complex inference algorithms that have limited theoretical justification. For example, hierarchical Bayesian models often require marginalizing over many latent variables. We show that our novel graphical models (even the k-wise interaction models) have simple and intuitive estimation procedures based on node-wise regressions that likely have similar theoretical guarantees as previous work in graphical models. For the copula-based graphical models, we show that simple approximations could still provide useful models; these copula models also come with closed-form conditional and marginal distributions, which make them amenable to exploratory inspection and manipulation. The parameters of these models are easy to interpret and thus may be accessible to a wide audience. Appealing - High-level visualization and interpretation of graphical models with even 100 variables has often been difficult even for a graphical model expert---despite visualization being one of the original motivators for graphical models. This difficulty is likely due to the lack of collaboration between graphical model experts and visualization experts. To begin bridging this gap, we develop a novel "what if?" interaction that manipulates and leverages the probabilistic power of graphical models. Our approach defines: the probabilistic mechanism via conditional probability; the query language to map text input to a conditional probability query; and the formal underlying probabilistic model. We then propose to visualize these query-specific probabilistic graphical models by combining the intuitiveness of force-directed layouts with the beauty and readability of word clouds, which pack many words into valuable screen space while ensuring words do not overlap via pixel-level collision detection. Although both the force-directed layout and the pixel-level packing problems are challenging in their own right, we approximate both simultaneously via adaptive simulated annealing starting from careful initialization. For visualizing mixture distributions, we also design a meaningful mapping from the properties of the mixture distribution to a color in the perceptually uniform CIELUV color space. Finally, we demonstrate our approach via illustrative visualizations of several real-world datasets.Computer Science

Texas ScholarWorks

PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval

Author: Bakari Ibrahim Bala
Publication venue
Publication date: 01/07/2021
Field of study

The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will disintegrate at the end of the model definition. This procedure often leads to down Playing word correlation with topics and reducing retrieved documents' precision or accuracy. The existing document model, such as the Latent Dirichlet Allocation (LDA) model, does not accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma Latent Dirichlet Allocation (PGLDA) model for modeling word dependencies in topic modeling is introduced. The PGLDA model relaxes the words independence assumption in the existing Latent Dirichlet Allocation (LDA) model by introducing the Gamma distribution that captures the correlation between adjacent words in documents. The PGLDA is hybridized with the distributed representation of documents (Doc2Vec) and topics (Topic2Vec) to form a new model named PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec and Topic2Vec vectors to form new word representation vectors, combined with topics with the largest estimated probability using PGLDA. Model estimations for PGLDA and PGLDA2Vec models were achieved by combining the Laplacian approximation of log-likelihood for PGLDA and Feed-Forward Neural Network (FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the hybrid PGLDA2Vec models were assessed using precision, micro F1 scores, perplexity, and coherence score. The empirical analysis results using three real-world datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of 96.3%, across the three datasets is better than other competing models reviewed

UTHM Institutional Repository

Graphical models beyond standard settings: lifted decimation, labeling, and counting

Author: Hadiji Fabian
Publication venue
Publication date
Field of study

With increasing complexity and growing problem sizes in AI and Machine Learning, inference and learning are still major issues in Probabilistic Graphical Models (PGMs). On the other hand, many problems are specified in such a way that symmetries arise from the underlying model structure. Exploiting these symmetries during inference, which is referred to as "lifted inference", has lead to significant efficiency gains. This thesis provides several enhanced versions of known algorithms that show to be liftable too and thereby applies lifting in "non-standard" settings. By doing so, the understanding of the applicability of lifted inference and lifting in general is extended. Among various other experiments, it is shown how lifted inference in combination with an innovative Web-based data harvesting pipeline is used to label author-paper-pairs with geographic information in online bibliographies. This results is a large-scale transnational bibliography containing affiliation information over time for roughly one million authors. Analyzing this dataset reveals the importance of understanding count data. Although counting is done literally everywhere, mainstream PGMs have widely been neglecting count data. In the case where the ranges of the random variables are defined over the natural numbers, crude approximations to the true distribution are often made by discretization or a Gaussian assumption. To handle count data, Poisson Dependency Networks (PDNs) are introduced which presents a new class of non-standard PGMs naturally handling count data

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

MEDINFO 2017: precision healthcare through informatics : proceedings of the 16th World Congress on Medical and Health Informatics

Author
Publication venue: 'IOS Press'
Publication date: 01/01/2017
Field of study

Digitale Bibliothek Thüringen