1,018 research outputs found
A Novel Document Generation Process for Topic Detection based on Hierarchical Latent Tree Models
We propose a novel document generation process based on hierarchical latent
tree models (HLTMs) learned from data. An HLTM has a layer of observed word
variables at the bottom and multiple layers of latent variables on top. For
each document, we first sample values for the latent variables layer by layer
via logic sampling, then draw relative frequencies for the words conditioned on
the values of the latent variables, and finally generate words for the document
using the relative word frequencies. The motivation for the work is to take
word counts into consideration with HLTMs. In comparison with LDA-based
hierarchical document generation processes, the new process achieves
drastically better model fit with much fewer parameters. It also yields more
meaningful topics and topic hierarchies. It is the new state-of-the-art for the
hierarchical topic detection
Combining Thesaurus Knowledge and Probabilistic Topic Models
In this paper we present the approach of introducing thesaurus knowledge into
probabilistic topic models. The main idea of the approach is based on the
assumption that the frequencies of semantically related words and phrases,
which are met in the same texts, should be enhanced: this action leads to their
larger contribution into topics found in these texts. We have conducted
experiments with several thesauri and found that for improving topic models, it
is useful to utilize domain-specific knowledge. If a general thesaurus, such as
WordNet, is used, the thesaurus-based improvement of topic models can be
achieved with excluding hyponymy relations in combined topic models.Comment: Accepted to AIST-2017 conference (http://aistconf.ru/). The final
publication will be available at link.springer.co
Feature LDA: a supervised topic model for automatic detection of Web API documentations from the Web
Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models
Object Matching in Distributed Video Surveillance Systems by LDA-Based Appearance Descriptors
Establishing correspondences among object instances is still challenging in multi-camera surveillance systems, especially when the cameras’ fields of view are non-overlapping. Spatiotemporal constraints can help in solving the correspondence problem but still leave a wide margin of uncertainty. One way to reduce this uncertainty is to use appearance information about the moving objects in the site. In this paper we present the preliminary results of a new method that can capture salient appearance characteristics at each camera node in the network. A Latent Dirichlet Allocation (LDA) model is created and maintained at each node in the camera network. Each object is encoded in terms of the LDA bag-of-words model for appearance. The encoded appearance is then used to establish probable matching across cameras. Preliminary experiments are conducted on a dataset of 20 individuals and comparison against Madden’s I-MCHR is reported
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking
Data generation is a key issue in big data benchmarking that aims to generate
application-specific data sets to meet the 4V requirements of big data.
Specifically, big data generators need to generate scalable data (Volume) of
different types (Variety) under controllable generation rates (Velocity) while
keeping the important characteristics of raw data (Veracity). This gives rise
to various new challenges about how we design generators efficiently and
successfully. To date, most existing techniques can only generate limited types
of data and support specific big data systems such as Hadoop. Hence we develop
a tool, called Big Data Generator Suite (BDGS), to efficiently generate
scalable big data while employing data models derived from real data to
preserve data veracity. The effectiveness of BDGS is demonstrated by developing
six data generators covering three representative data types (structured,
semi-structured and unstructured) and three data sources (text, graph, and
table data)
Nonparametric Hierarchical Clustering of Functional Data
In this paper, we deal with the problem of curves clustering. We propose a
nonparametric method which partitions the curves into clusters and discretizes
the dimensions of the curve points into intervals. The cross-product of these
partitions forms a data-grid which is obtained using a Bayesian model selection
approach while making no assumptions regarding the curves. Finally, a
post-processing technique, aiming at reducing the number of clusters in order
to improve the interpretability of the clustering, is proposed. It consists in
optimally merging the clusters step by step, which corresponds to an
agglomerative hierarchical classification whose dissimilarity measure is the
variation of the criterion. Interestingly this measure is none other than the
sum of the Kullback-Leibler divergences between clusters distributions before
and after the merges. The practical interest of the approach for functional
data exploratory analysis is presented and compared with an alternative
approach on an artificial and a real world data set
Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile Location Data
This paper analyzes consumer choices over lunchtime restaurants using data
from a sample of several thousand anonymous mobile phone users in the San
Francisco Bay Area. The data is used to identify users' approximate typical
morning location, as well as their choices of lunchtime restaurants. We build a
model where restaurants have latent characteristics (whose distribution may
depend on restaurant observables, such as star ratings, food category, and
price range), each user has preferences for these latent characteristics, and
these preferences are heterogeneous across users. Similarly, each item has
latent characteristics that describe users' willingness to travel to the
restaurant, and each user has individual-specific preferences for those latent
characteristics. Thus, both users' willingness to travel and their base utility
for each restaurant vary across user-restaurant pairs. We use a Bayesian
approach to estimation. To make the estimation computationally feasible, we
rely on variational inference to approximate the posterior distribution, as
well as stochastic gradient descent as a computational approach. Our model
performs better than more standard competing models such as multinomial logit
and nested logit models, in part due to the personalization of the estimates.
We analyze how consumers re-allocate their demand after a restaurant closes to
nearby restaurants versus more distant restaurants with similar
characteristics, and we compare our predictions to actual outcomes. Finally, we
show how the model can be used to analyze counterfactual questions such as what
type of restaurant would attract the most consumers in a given location.Marie Curie Fellowship from the European Commission (H2020 programme, grant agreement 706760)
A cross-center smoothness prior for variational Bayesian brain tissue segmentation
Suppose one is faced with the challenge of tissue segmentation in MR images,
without annotators at their center to provide labeled training data. One option
is to go to another medical center for a trained classifier. Sadly, tissue
classifiers do not generalize well across centers due to voxel intensity shifts
caused by center-specific acquisition protocols. However, certain aspects of
segmentations, such as spatial smoothness, remain relatively consistent and can
be learned separately. Here we present a smoothness prior that is fit to
segmentations produced at another medical center. This informative prior is
presented to an unsupervised Bayesian model. The model clusters the voxel
intensities, such that it produces segmentations that are similarly smooth to
those of the other medical center. In addition, the unsupervised Bayesian model
is extended to a semi-supervised variant, which needs no visual interpretation
of clusters into tissues.Comment: 12 pages, 2 figures, 1 table. Accepted to the International
Conference on Information Processing in Medical Imaging (2019
A Theoretical Analysis of Two-Stage Recommendation for Cold-Start Collaborative Filtering
In this paper, we present a theoretical framework for tackling the cold-start
collaborative filtering problem, where unknown targets (items or users) keep
coming to the system, and there is a limited number of resources (users or
items) that can be allocated and related to them. The solution requires a
trade-off between exploitation and exploration as with the limited
recommendation opportunities, we need to, on one hand, allocate the most
relevant resources right away, but, on the other hand, it is also necessary to
allocate resources that are useful for learning the target's properties in
order to recommend more relevant ones in the future. In this paper, we study a
simple two-stage recommendation combining a sequential and a batch solution
together. We first model the problem with the partially observable Markov
decision process (POMDP) and provide an exact solution. Then, through an
in-depth analysis over the POMDP value iteration solution, we identify that an
exact solution can be abstracted as selecting resources that are not only
highly relevant to the target according to the initial-stage information, but
also highly correlated, either positively or negatively, with other potential
resources for the next stage. With this finding, we propose an approximate
solution to ease the intractability of the exact solution. Our initial results
on synthetic data and the Movie Lens 100K dataset confirm the performance gains
of our theoretical development and analysis
Skew-Unfolding the Skorokhod Reflection of a Continuous Semimartingale
The Skorokhod reflection of a continuous semimartingale is unfolded, in a
possibly skewed manner, into another continuous semimartingale on an enlarged
probability space according to the excursion-theoretic methodology of Prokaj
(2009). This is done in terms of a skew version of the Tanaka equation, whose
properties are studied in some detail. The result is used to construct a system
of two diffusive particles with rank-based characteristics and skew-elastic
collisions. Unfoldings of conventional reflections are also discussed, as are
examples involving skew Brownian Motions and skew Bessel processes.Comment: 20 pages. typos corrected, added a remark after Proposition 2.3,
simplified the last part of Example 2.
- …