3,179 research outputs found
Event Identification in Social Networks
Social networks enable users to freely communicate with each other and share
their recent news, ongoing activities or views about different topics. As a
result, they can be seen as a potentially viable source of information to
understand the current emerging topics/events. The ability to model emerging
topics is a substantial step to monitor and summarize the information
originating from social sources. Applying traditional methods for event
detection which are often proposed for processing large, formal and structured
documents, are less effective, due to the short length, noisiness and
informality of the social posts. Recent event detection techniques address
these challenges by exploiting the opportunities behind abundant information
available in social networks. This article provides an overview of the state of
the art in event detection from social networks.Comment: It will appear in Encyclopedia with Semantic Computing to be
published by World Scientifi
Doctoral Advisor or Medical Condition: Towards Entity-specific Rankings of Knowledge Base Properties [Extended Version]
In knowledge bases such as Wikidata, it is possible to assert a large set of
properties for entities, ranging from generic ones such as name and place of
birth to highly profession-specific or background-specific ones such as
doctoral advisor or medical condition. Determining a preference or ranking in
this large set is a challenge in tasks such as prioritisation of edits or
natural-language generation. Most previous approaches to ranking knowledge base
properties are purely data-driven, that is, as we show, mistake frequency for
interestingness.
In this work, we have developed a human-annotated dataset of 350 preference
judgments among pairs of knowledge base properties for fixed entities. From
this set, we isolate a subset of pairs for which humans show a high level of
agreement (87.5% on average). We show, however, that baseline and
state-of-the-art techniques achieve only 61.3% precision in predicting human
preferences for this subset.
We then analyze what contributes to one property being rated as more
important than another one, and identify that at least three factors play a
role, namely (i) general frequency, (ii) applicability to similar entities and
(iii) semantic similarity between property and entity. We experimentally
analyze the contribution of each factor and show that a combination of
techniques addressing all the three factors achieves 74% precision on the task.
The dataset is available at
www.kaggle.com/srazniewski/wikidatapropertyranking.Comment: Extended version of an ADMA 2017 conference pape
Content-based Video Indexing and Retrieval Using Corr-LDA
Existing video indexing and retrieval methods on popular web-based multimedia
sharing websites are based on user-provided sparse tagging. This paper proposes
a very specific way of searching for video clips, based on the content of the
video. We present our work on Content-based Video Indexing and Retrieval using
the Correspondence-Latent Dirichlet Allocation (corr-LDA) probabilistic
framework. This is a model that provides for auto-annotation of videos in a
database with textual descriptors, and brings the added benefit of utilizing
the semantic relations between the content of the video and text. We use the
concept-level matching provided by corr-LDA to build correspondences between
text and multimedia, with the objective of retrieving content with increased
accuracy. In our experiments, we employ only the audio components of the
individual recordings and compare our results with an SVM-based approach.Comment: 8 Pages, Updated References, Added Figure
Temporal Identification of Latent Communities on Twitter
User communities in social networks are usually identified by considering
explicit structural social connections between users. While such communities
can reveal important information about their members such as family or
friendship ties and geographical proximity, they do not necessarily succeed at
pulling like-minded users that share the same interests together. In this
paper, we are interested in identifying communities of users that share similar
topical interests over time, regardless of whether they are explicitly
connected to each other on the social network. More specifically, we tackle the
problem of identifying temporal topic-based communities from Twitter, i.e.,
communities of users who have similar temporal inclination towards the current
emerging topics on Twitter. We model each topic as a collection of highly
correlated semantic concepts observed in tweets and identify them by clustering
the time-series based representation of each concept built based on each
concept's observation frequency over time. Based on the identified emerging
topics in a given time period, we utilize multivariate time series analysis to
model the contributions of each user towards the identified topics, which
allows us to detect latent user communities. Through our experiments on Twitter
data, we demonstrate i) the effectiveness of our topic detection method to
detect real world topics and ii) the effectiveness of our approach compared to
well-established approaches for community detection.Comment: Submitted to WSDM 201
What do Vegans do in their Spare Time? Latent Interest Detection in Multi-Community Networks
Most social network analysis works at the level of interactions between
users. But the vast growth in size and complexity of social networks enables us
to examine interactions at larger scale. In this work we use a dataset of 76M
submissions to the social network Reddit, which is organized into distinct
sub-communities called subreddits. We measure the similarity between entire
subreddits both in terms of user similarity and topical similarity. Our goal is
to find community pairs with similar userbases, but dissimilar content; we
refer to this type of relationship as a "latent interest." Detection of latent
interests not only provides a perspective on individual users as they shift
between roles (student, sports fan, political activist) but also gives insight
into the dynamics of Reddit as a whole. Latent interest detection also has
potential applications for recommendation systems and for researchers examining
community evolution.Comment: NIPS 2015 Network Worksho
Supervised Laplacian Eigenmaps with Applications in Clinical Diagnostics for Pediatric Cardiology
Electronic health records contain rich textual data which possess critical
predictive information for machine-learning based diagnostic aids. However many
traditional machine learning methods fail to simultaneously integrate both
vector space data and text. We present a supervised method using Laplacian
eigenmaps to augment existing machine-learning methods with low-dimensional
representations of textual predictors which preserve the local similarities.
The proposed implementation performs alternating optimization using gradient
descent. For the evaluation we applied our method to over 2,000 patient records
from a large single-center pediatric cardiology practice to predict if patients
were diagnosed with cardiac disease. Our method was compared with latent
semantic indexing, latent Dirichlet allocation, and local Fisher discriminant
analysis. The results were assessed using AUC, MCC, specificity, and
sensitivity. Results indicate supervised Laplacian eigenmaps was the highest
performing method in our study, achieving 0.782 and 0.374 for AUC and MCC
respectively. SLE showed an increase in 8.16% in AUC and 20.6% in MCC over the
baseline which excluded textual data and a 2.69% and 5.35% increase in AUC and
MCC respectively over unsupervised Laplacian eigenmaps. This method allows many
existing machine learning predictors to effectively and efficiently utilize the
potential of textual predictors
Image Tag Refinement by Regularized Latent Dirichlet Allocation
Tagging is nowadays the most prevalent and practical way to make images
searchable. However, in reality many manually-assigned tags are irrelevant to
image content and hence are not reliable for applications. A lot of recent
efforts have been conducted to refine image tags. In this paper, we propose to
do tag refinement from the angle of topic modeling and present a novel
graphical model, regularized Latent Dirichlet Allocation (rLDA). In the
proposed approach, tag similarity and tag relevance are jointly estimated in an
iterative manner, so that they can benefit from each other, and the multi-wise
relationships among tags are explored. Moreover, both the statistics of tags
and visual affinities of images in the corpus are explored to help topic
modeling. We also analyze the superiority of our approach from the deep
structure perspective. The experiments on tag ranking and image retrieval
demonstrate the advantages of the proposed method
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
Self-supervised learning of visual features through embedding images into text topic spaces
End-to-end training from scratch of current deep architectures for new
computer vision problems would require Imagenet-scale datasets, and this is not
always possible. In this paper we present a method that is able to take
advantage of freely available multi-modal content to train computer vision
algorithms without human supervision. We put forward the idea of performing
self-supervised learning of visual features by mining a large scale corpus of
multi-modal (text and image) documents. We show that discriminative visual
features can be learnt efficiently by training a CNN to predict the semantic
context in which a particular image is more probable to appear as an
illustration. For this we leverage the hidden semantic structures discovered in
the text corpus with a well-known topic modeling technique. Our experiments
demonstrate state of the art performance in image classification, object
detection, and multi-modal retrieval compared to recent self-supervised or
natural-supervised approaches.Comment: Accepted CVPR 2017 pape
How to Become Instagram Famous: Post Popularity Prediction with Dual-Attention
With a growing number of social apps, people have become increasingly willing
to share their everyday photos and events on social media platforms, such as
Facebook, Instagram, and WeChat. In social media data mining, post popularity
prediction has received much attention from both data scientists and
psychologists. Existing research focuses more on exploring the post popularity
on a population of users and including comprehensive factors such as temporal
information, user connections, number of comments, and so on. However, these
frameworks are not suitable for guiding a specific user to make a popular post
because the attributes of this user are fixed. Therefore, previous frameworks
can only answer the question "whether a post is popular" rather than "how to
become famous by popular posts". In this paper, we aim at predicting the
popularity of a post for a specific user and mining the patterns behind the
popularity. To this end, we first collect data from Instagram. We then design a
method to figure out the user environment, representing the content that a
specific user is very likely to post. Based on the relevant data, we devise a
novel dual-attention model to incorporate image, caption, and user environment.
The dual-attention model basically consists of two parts, explicit attention
for image-caption pairs and implicit attention for user environment. A
hierarchical structure is devised to concatenate the explicit attention part
and implicit attention part. We conduct a series of experiments to validate the
effectiveness of our model and investigate the factors that can influence the
popularity. The classification results show that our model outperforms the
baselines, and a statistical analysis identifies what kind of pictures or
captions can help the user achieve a relatively high "likes" number.Comment: 2018 IEEE International Conference on Big Data (IEEE Big Data
- …