11,858 research outputs found
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
Latent Space Model for Multi-Modal Social Data
With the emergence of social networking services, researchers enjoy the
increasing availability of large-scale heterogenous datasets capturing online
user interactions and behaviors. Traditional analysis of techno-social systems
data has focused mainly on describing either the dynamics of social
interactions, or the attributes and behaviors of the users. However,
overwhelming empirical evidence suggests that the two dimensions affect one
another, and therefore they should be jointly modeled and analyzed in a
multi-modal framework. The benefits of such an approach include the ability to
build better predictive models, leveraging social network information as well
as user behavioral signals. To this purpose, here we propose the Constrained
Latent Space Model (CLSM), a generalized framework that combines Mixed
Membership Stochastic Blockmodels (MMSB) and Latent Dirichlet Allocation (LDA)
incorporating a constraint that forces the latent space to concurrently
describe the multiple data modalities. We derive an efficient inference
algorithm based on Variational Expectation Maximization that has a
computational cost linear in the size of the network, thus making it feasible
to analyze massive social datasets. We validate the proposed framework on two
problems: prediction of social interactions from user attributes and behaviors,
and behavior prediction exploiting network information. We perform experiments
with a variety of multi-modal social systems, spanning location-based social
networks (Gowalla), social media services (Instagram, Orkut), e-commerce and
review sites (Amazon, Ciao), and finally citation networks (Cora). The results
indicate significant improvement in prediction accuracy over state of the art
methods, and demonstrate the flexibility of the proposed approach for
addressing a variety of different learning problems commonly occurring with
multi-modal social data.Comment: 12 pages, 7 figures, 2 table
Finding co-solvers on Twitter, with a little help from Linked Data
In this paper we propose a method for suggesting potential collaborators for solving innovation challenges online, based on their competence, similarity of interests and social proximity with the user. We rely on Linked Data to derive a measure of semantic relatedness that we use to enrich both user profiles and innovation problems with additional relevant topics, thereby improving the performance of co-solver recommendation. We evaluate this approach against state of the art methods for query enrichment based on the distribution of topics in user profiles, and demonstrate its usefulness in recommending collaborators that are both complementary in competence and compatible with the user. Our experiments are grounded using data from the social networking service Twitter.com
Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling
Spambot detection in online social networks is a long-lasting challenge
involving the study and design of detection techniques capable of efficiently
identifying ever-evolving spammers. Recently, a new wave of social spambots has
emerged, with advanced human-like characteristics that allow them to go
undetected even by current state-of-the-art algorithms. In this paper, we show
that efficient spambots detection can be achieved via an in-depth analysis of
their collective behaviors exploiting the digital DNA technique for modeling
the behaviors of social network users. Inspired by its biological counterpart,
in the digital DNA representation the behavioral lifetime of a digital account
is encoded in a sequence of characters. Then, we define a similarity measure
for such digital DNA sequences. We build upon digital DNA and the similarity
between groups of users to characterize both genuine accounts and spambots.
Leveraging such characterization, we design the Social Fingerprinting
technique, which is able to discriminate among spambots and genuine accounts in
both a supervised and an unsupervised fashion. We finally evaluate the
effectiveness of Social Fingerprinting and we compare it with three
state-of-the-art detection algorithms. Among the peculiarities of our approach
is the possibility to apply off-the-shelf DNA analysis techniques to study
online users behaviors and to efficiently rely on a limited number of
lightweight account characteristics
- ā¦