10,454 research outputs found
Knowledge-aware Complementary Product Representation Learning
Learning product representations that reflect complementary relationship
plays a central role in e-commerce recommender system. In the absence of the
product relationships graph, which existing methods rely on, there is a need to
detect the complementary relationships directly from noisy and sparse customer
purchase activities. Furthermore, unlike simple relationships such as
similarity, complementariness is asymmetric and non-transitive. Standard usage
of representation learning emphasizes on only one set of embedding, which is
problematic for modelling such properties of complementariness. We propose
using knowledge-aware learning with dual product embedding to solve the above
challenges. We encode contextual knowledge into product representation by
multi-task learning, to alleviate the sparsity issue. By explicitly modelling
with user bias terms, we separate the noise of customer-specific preferences
from the complementariness. Furthermore, we adopt the dual embedding framework
to capture the intrinsic properties of complementariness and provide geometric
interpretation motivated by the classic separating hyperplane theory. Finally,
we propose a Bayesian network structure that unifies all the components, which
also concludes several popular models as special cases. The proposed method
compares favourably to state-of-art methods, in downstream classification and
recommendation tasks. We also develop an implementation that scales efficiently
to a dataset with millions of items and customers
Finding the True Frequent Itemsets
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It
requires to identify all itemsets appearing in at least a fraction of
a transactional dataset . Often though, the ultimate goal of
mining is not an analysis of the dataset \emph{per se}, but the
understanding of the underlying process that generated it. Specifically, in
many applications is a collection of samples obtained from an
unknown probability distribution on transactions, and by extracting the
FIs in one attempts to infer itemsets that are frequently (i.e.,
with probability at least ) generated by , which we call the True
Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the
generative process, the set of FIs is only a rough approximation of the set of
TFIs, as it often contains a huge number of \emph{false positives}, i.e.,
spurious itemsets that are not among the TFIs. In this work we design and
analyze an algorithm to identify a threshold such that the
collection of itemsets with frequency at least in
contains only TFIs with probability at least , for some
user-specified . Our method uses results from statistical learning
theory involving the (empirical) VC-dimension of the problem at hand. This
allows us to identify almost all the TFIs without including any false positive.
We also experimentally compare our method with the direct mining of
at frequency and with techniques based on widely-used
standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and
show that our algorithm outperforms these methods and achieves even better
results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International
Conference on Data Mining, 201
Object Discovery From a Single Unlabeled Image by Mining Frequent Itemset With Multi-scale Features
TThe goal of our work is to discover dominant objects in a very general
setting where only a single unlabeled image is given. This is far more
challenge than typical co-localization or weakly-supervised localization tasks.
To tackle this problem, we propose a simple but effective pattern mining-based
method, called Object Location Mining (OLM), which exploits the advantages of
data mining and feature representation of pre-trained convolutional neural
networks (CNNs). Specifically, we first convert the feature maps from a
pre-trained CNN model into a set of transactions, and then discovers frequent
patterns from transaction database through pattern mining techniques. We
observe that those discovered patterns, i.e., co-occurrence highlighted
regions, typically hold appearance and spatial consistency. Motivated by this
observation, we can easily discover and localize possible objects by merging
relevant meaningful patterns. Extensive experiments on a variety of benchmarks
demonstrate that OLM achieves competitive localization performance compared
with the state-of-the-art methods. We also evaluate our approach compared with
unsupervised saliency detection methods and achieves competitive results on
seven benchmark datasets. Moreover, we conduct experiments on fine-grained
classification to show that our proposed method can locate the entire object
and parts accurately, which can benefit to improving the classification results
significantly
Generating ordered list of Recommended Items: a Hybrid Recommender System of Microblog
Precise recommendation of followers helps in improving the user experience
and maintaining the prosperity of twitter and microblog platforms. In this
paper, we design a hybrid recommender system of microblog as a solution of KDD
Cup 2012, track 1 task, which requires predicting users a user might follow in
Tencent Microblog. We describe the background of the problem and present the
algorithm consisting of keyword analysis, user taxonomy, (potential)interests
extraction and item recommendation. Experimental result shows the high
performance of our algorithm. Some possible improvements are discussed, which
leads to further study.Comment: 7 page
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
- …