370 research outputs found
Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes
We define a family of probability distributions for random count matrices
with a potentially unbounded number of rows and columns. The three
distributions we consider are derived from the gamma-Poisson, gamma-negative
binomial, and beta-negative binomial processes. Because the models lead to
closed-form Gibbs sampling update equations, they are natural candidates for
nonparametric Bayesian priors over count matrices. A key aspect of our analysis
is the recognition that, although the random count matrices within the family
are defined by a row-wise construction, their columns can be shown to be i.i.d.
This fact is used to derive explicit formulas for drawing all the columns at
once. Moreover, by analyzing these matrices' combinatorial structure, we
describe how to sequentially construct a column-i.i.d. random count matrix one
row at a time, and derive the predictive distribution of a new row count vector
with previously unseen features. We describe the similarities and differences
between the three priors, and argue that the greater flexibility of the gamma-
and beta- negative binomial processes, especially their ability to model
over-dispersed, heavy-tailed count data, makes these well suited to a wide
variety of real-world applications. As an example of our framework, we
construct a naive-Bayes text classifier to categorize a count vector to one of
several existing random count matrices of different categories. The classifier
supports an unbounded number of features, and unlike most existing methods, it
does not require a predefined finite vocabulary to be shared by all the
categories, and needs neither feature selection nor parameter tuning. Both the
gamma- and beta- negative binomial processes are shown to significantly
outperform the gamma-Poisson process for document categorization, with
comparable performance to other state-of-the-art supervised text classification
algorithms.Comment: To appear in Journal of the American Statistical Association (Theory
and Methods). 31 pages + 11 page supplement, 5 figure
Clustering and its Application in Requirements Engineering
Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process
Bayesian Hidden Topic Markov Models
Recent developments in topic modeling for text corpora have incorporated Markov models in the latent space to better learn contextual content. Known as the Hidden Topic Markov Model (HTMM), this natural extension of probabilistic mixture models relaxes the bag-of-words assumption of the foundational latent Dirichlet allocation topic model by allowing the discrete latent variables, or topics, to follow a special first-order Markov process. Parameter estimation is performed using an expectation-maximization (EM) algorithm with fixed dimensionality of the topic space (Gruber, Rosen-Zvi, and Weiss 2007). I fully derive the state space and EM algorithm for the HTMM. I then extend the Hidden Topic Markov Model (HTMM) into a fully Bayesian framework using a Gibbs sampler. The necessary full conditional distributions are derived and a Gibbs sampling algorithm proposed. I implement both the HTMM EM algorithm (Gruber, Rosen-Zvi, and Weiss 2007) and the HTMM Gibbs sampling algorithm in the R and C++ programming languages. The performance of both inferential algorithms is evaluated on twelve simulated data sets and on a collection of proceedings from the Conference on Neural Information Processing Systems (NIPS). The results suggest that the Gibbs sampling algorithm provides better recovery of the topic space than a combination of the EM and Viterbi algorithms. Parameter estimation is comparable using point estimates with both algorithms. The convergence of the Gibbs sampler is studied and is reliable for reasonably large data sets. Evaluation of both algorithms on the NIPS corpus suggests that the HTMM is better able to handle polysemy than LDA and provides coherent and contiguous topics. Predictive accuracy measured by perplexity is better on training and test documents using the HTMM than using LDA on the NIPS corpus. Introducing Markovian dynamics in topical space provides better topical segmentation of a corpus and increased predictive accuracy for unseen documents
- …