107 research outputs found
Online Influence Maximization (Extended Version)
Social networks are commonly used for marketing purposes. For example, free
samples of a product can be given to a few influential social network users (or
"seed nodes"), with the hope that they will convince their friends to buy it.
One way to formalize marketers' objective is through influence maximization (or
IM), whose goal is to find the best seed nodes to activate under a fixed
budget, so that the number of people who get influenced in the end is
maximized. Recent solutions to IM rely on the influence probability that a user
influences another one. However, this probability information may be
unavailable or incomplete. In this paper, we study IM in the absence of
complete information on influence probability. We call this problem Online
Influence Maximization (OIM) since we learn influence probabilities at the same
time we run influence campaigns. To solve OIM, we propose a multiple-trial
approach, where (1) some seed nodes are selected based on existing influence
information; (2) an influence campaign is started with these seed nodes; and
(3) users' feedback is used to update influence information. We adopt the
Explore-Exploit strategy, which can select seed nodes using either the current
influence probability estimation (exploit), or the confidence bound on the
estimation (explore). Any existing IM algorithm can be used in this framework.
We also develop an incremental algorithm that can significantly reduce the
overhead of handling users' feedback information. Our experiments show that our
solution is more effective than traditional IM methods on the partial
information.Comment: 13 pages. To appear in KDD 2015. Extended versio
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
Exploring Communities in Large Profiled Graphs
Given a graph and a vertex , the community search (CS) problem
aims to efficiently find a subgraph of whose vertices are closely related
to . Communities are prevalent in social and biological networks, and can be
used in product advertisement and social event recommendation. In this paper,
we study profiled community search (PCS), where CS is performed on a profiled
graph. This is a graph in which each vertex has labels arranged in a
hierarchical manner. Extensive experiments show that PCS can identify
communities with themes that are common to their vertices, and is more
effective than existing CS approaches. As a naive solution for PCS is highly
expensive, we have also developed a tree index, which facilitate efficient and
online solutions for PCS
Spatio-temporal flow patterns
Transportation companies and organizations routinely collect huge volumes of
passenger transportation data. By aggregating these data (e.g., counting the
number of passengers going from a place to another in every 30 minute
interval), it becomes possible to analyze the movement behavior of passengers
in a metropolitan area. In this paper, we study the problem of finding
important trends in passenger movements at varying granularities, which is
useful in a wide range of applications such as target marketing, scheduling,
and travel intent prediction. Specifically, we study the extraction of movement
patterns between regions that have significant flow. The huge number of
possible patterns render their detection computationally hard. We propose
algorithms that greatly reduce the search space and the computational cost of
pattern detection. We study variants of patterns that could be useful to
different problem instances, such as constrained patterns and top-k ranked
patterns
Multi-domain Recommendation with Embedding Disentangling and Domain Alignment
Multi-domain recommendation (MDR) aims to provide recommendations for
different domains (e.g., types of products) with overlapping users/items and is
common for platforms such as Amazon, Facebook, and LinkedIn that host multiple
services. Existing MDR models face two challenges: First, it is difficult to
disentangle knowledge that generalizes across domains (e.g., a user likes cheap
items) and knowledge specific to a single domain (e.g., a user likes blue
clothing but not blue cars). Second, they have limited ability to transfer
knowledge across domains with small overlaps. We propose a new MDR method named
EDDA with two key components, i.e., embedding disentangling recommender and
domain alignment, to tackle the two challenges respectively. In particular, the
embedding disentangling recommender separates both the model and embedding for
the inter-domain part and the intra-domain part, while most existing MDR
methods only focus on model-level disentangling. The domain alignment leverages
random walks from graph processing to identify similar user/item pairs from
different domains and encourages similar user/item pairs to have similar
embeddings, enhancing knowledge transfer. We compare EDDA with 12
state-of-the-art baselines on 3 real datasets. The results show that EDDA
consistently outperforms the baselines on all datasets and domains. All
datasets and codes are available at https://github.com/Stevenn9981/EDDA.Comment: Accepted by CIKM'23 as a Long pape
Truth Inference in Crowdsourcing: Is the Problem Solved?
Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions
- …