10,219 research outputs found
Scalable Text and Link Analysis with Mixed-Topic Link Models
Many data sets contain rich information about objects, as well as pairwise
relations between them. For instance, in networks of websites, scientific
papers, and other documents, each node has content consisting of a collection
of words, as well as hyperlinks or citations to other nodes. In order to
perform inference on such data sets, and make predictions and recommendations,
it is useful to have models that are able to capture the processes which
generate the text at each node and the links between them. In this paper, we
combine classic ideas in topic modeling with a variant of the mixed-membership
block model recently developed in the statistical physics community. The
resulting model has the advantage that its parameters, including the mixture of
topics of each document and the resulting overlapping communities, can be
inferred with a simple and scalable expectation-maximization algorithm. We test
our model on three data sets, performing unsupervised topic classification and
link prediction. For both tasks, our model outperforms several existing
state-of-the-art methods, achieving higher accuracy with significantly less
computation, analyzing a data set with 1.3 million words and 44 thousand links
in a few minutes.Comment: 11 pages, 4 figure
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Customer purchase behavior prediction in E-commerce: a conceptual framework and research agenda
Digital retailers are experiencing an increasing number of transactions coming from their consumers online, a consequence of the convenience in buying goods via E-commerce platforms. Such interactions compose complex behavioral patterns which can be analyzed through predictive analytics to enable businesses to understand consumer needs. In this abundance of big data and possible tools to analyze them, a systematic review of the literature is missing. Therefore, this paper presents a systematic literature review of recent research dealing with customer purchase prediction in the E-commerce context. The main contributions are a novel analytical framework and a research agenda in the field. The framework reveals three main tasks in this review, namely, the prediction of customer intents, buying sessions, and purchase decisions. Those are followed by their employed predictive methodologies and are analyzed from three perspectives. Finally, the research agenda provides major existing issues for further research in the field of purchase behavior prediction online
Latent Space Model for Multi-Modal Social Data
With the emergence of social networking services, researchers enjoy the
increasing availability of large-scale heterogenous datasets capturing online
user interactions and behaviors. Traditional analysis of techno-social systems
data has focused mainly on describing either the dynamics of social
interactions, or the attributes and behaviors of the users. However,
overwhelming empirical evidence suggests that the two dimensions affect one
another, and therefore they should be jointly modeled and analyzed in a
multi-modal framework. The benefits of such an approach include the ability to
build better predictive models, leveraging social network information as well
as user behavioral signals. To this purpose, here we propose the Constrained
Latent Space Model (CLSM), a generalized framework that combines Mixed
Membership Stochastic Blockmodels (MMSB) and Latent Dirichlet Allocation (LDA)
incorporating a constraint that forces the latent space to concurrently
describe the multiple data modalities. We derive an efficient inference
algorithm based on Variational Expectation Maximization that has a
computational cost linear in the size of the network, thus making it feasible
to analyze massive social datasets. We validate the proposed framework on two
problems: prediction of social interactions from user attributes and behaviors,
and behavior prediction exploiting network information. We perform experiments
with a variety of multi-modal social systems, spanning location-based social
networks (Gowalla), social media services (Instagram, Orkut), e-commerce and
review sites (Amazon, Ciao), and finally citation networks (Cora). The results
indicate significant improvement in prediction accuracy over state of the art
methods, and demonstrate the flexibility of the proposed approach for
addressing a variety of different learning problems commonly occurring with
multi-modal social data.Comment: 12 pages, 7 figures, 2 table
Exploring Interpretable LSTM Neural Networks over Multi-Variable Data
For recurrent neural networks trained on time series with target and
exogenous variables, in addition to accurate prediction, it is also desired to
provide interpretable insights into the data. In this paper, we explore the
structure of LSTM recurrent neural networks to learn variable-wise hidden
states, with the aim to capture different dynamics in multi-variable time
series and distinguish the contribution of variables to the prediction. With
these variable-wise hidden states, a mixture attention mechanism is proposed to
model the generative process of the target. Then we develop associated training
methods to jointly learn network parameters, variable and temporal importance
w.r.t the prediction of the target variable. Extensive experiments on real
datasets demonstrate enhanced prediction performance by capturing the dynamics
of different variables. Meanwhile, we evaluate the interpretation results both
qualitatively and quantitatively. It exhibits the prospect as an end-to-end
framework for both forecasting and knowledge extraction over multi-variable
data.Comment: Accepted to International Conference on Machine Learning (ICML), 201
Data Cube Approximation and Mining using Probabilistic Modeling
On-line Analytical Processing (OLAP) techniques commonly used in data warehouses allow the exploration of data cubes according to different analysis axes (dimensions) and under different abstraction levels in a dimension hierarchy. However, such techniques are not aimed at mining multidimensional data.
Since data cubes are nothing but multi-way tables, we propose to analyze the potential of two probabilistic modeling techniques, namely non-negative multi-way array factorization and log-linear modeling, with the ultimate objective of compressing and mining aggregate and multidimensional values. With the first technique, we compute the set of components that best fit the initial data set and whose superposition coincides with the original data; with the second technique we identify a parsimonious model (i.e., one with a reduced set of parameters), highlight strong associations among dimensions and discover possible outliers in data cells. A real life example will be
used to (i) discuss the potential benefits of the modeling output on cube exploration and mining, (ii) show how OLAP queries can be answered in an approximate way, and (iii) illustrate the strengths and limitations of these modeling approaches
- …