9,355 research outputs found
Towards more accurate content categorization of API discussions
Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined se-mantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content catego-rization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effec-tive classification algorithm. In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussion-s. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and o-riginal. In the text component, CASE only considers the textual de-scription; in the code component, CASE only considers the source code; in the original component, CASE considers the original con-tent of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better ac-curacy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively
Linking Data Across Universities: An Integrated Video Lectures Dataset
This paper presents our work and experience interlinking educational information across universities through the use of Linked Data principles and technologies. More specifically this paper is focused on selecting, extracting, structuring and interlinking information of video lectures produced by 27 different educational institutions. For this purpose, selected information from several websites and YouTube channels have been scraped and structured according to well-known vocabularies, like FOAF 1, or the W3C Ontology for Media Resources 2. To integrate this information, the extracted videos have been categorized under a common classification space, the taxonomy defined by the Open Directory Project 3. An evaluation of this categorization process has been conducted obtaining a 98% degree of coverage and 89% degree of correctness. As a result of this process a new Linked Data dataset has been released containing more than 14,000 video lectures from 27 different institutions and categorized under a common classification scheme
Towards Measuring Adversarial Twitter Interactions against Candidates in the US Midterm Elections
Adversarial interactions against politicians on social media such as Twitter
have significant impact on society. In particular they disrupt substantive
political discussions online, and may discourage people from seeking public
office. In this study, we measure the adversarial interactions against
candidates for the US House of Representatives during the run-up to the 2018 US
general election. We gather a new dataset consisting of 1.7 million tweets
involving candidates, one of the largest corpora focusing on political
discourse. We then develop a new technique for detecting tweets with toxic
content that are directed at any specific candidate.Such technique allows us to
more accurately quantify adversarial interactions towards political candidates.
Further, we introduce an algorithm to induce candidate-specific adversarial
terms to capture more nuanced adversarial interactions that previous techniques
may not consider toxic. Finally, we use these techniques to outline the breadth
of adversarial interactions seen in the election, including offensive
name-calling, threats of violence, posting discrediting information, attacks on
identity, and adversarial message repetition
Online Human-Bot Interactions: Detection, Estimation, and Characterization
Increasing evidence suggests that a growing amount of social media content is
generated by autonomous entities known as social bots. In this work we present
a framework to detect such entities on Twitter. We leverage more than a
thousand features extracted from public data and meta-data about users:
friends, tweet content and sentiment, network patterns, and activity time
series. We benchmark the classification framework by using a publicly available
dataset of Twitter bots. This training data is enriched by a manually annotated
collection of active Twitter users that include both humans and bots of varying
sophistication. Our models yield high accuracy and agreement with each other
and can detect bots of different nature. Our estimates suggest that between 9%
and 15% of active Twitter accounts are bots. Characterizing ties among
accounts, we observe that simple bots tend to interact with bots that exhibit
more human-like behaviors. Analysis of content flows reveals retweet and
mention strategies adopted by bots to interact with different target groups.
Using clustering analysis, we characterize several subclasses of accounts,
including spammers, self promoters, and accounts that post content from
connected applications.Comment: Accepted paper for ICWSM'17, 10 pages, 8 figures, 1 tabl
On the Role of Social Identity and Cohesion in Characterizing Online Social Communities
Two prevailing theories for explaining social group or community structure
are cohesion and identity. The social cohesion approach posits that social
groups arise out of an aggregation of individuals that have mutual
interpersonal attraction as they share common characteristics. These
characteristics can range from common interests to kinship ties and from social
values to ethnic backgrounds. In contrast, the social identity approach posits
that an individual is likely to join a group based on an intrinsic
self-evaluation at a cognitive or perceptual level. In other words group
members typically share an awareness of a common category membership.
In this work we seek to understand the role of these two contrasting theories
in explaining the behavior and stability of social communities in Twitter. A
specific focal point of our work is to understand the role of these theories in
disparate contexts ranging from disaster response to socio-political activism.
We extract social identity and social cohesion features-of-interest for large
scale datasets of five real-world events and examine the effectiveness of such
features in capturing behavioral characteristics and the stability of groups.
We also propose a novel measure of social group sustainability based on the
divergence in group discussion. Our main findings are: 1) Sharing of social
identities (especially physical location) among group members has a positive
impact on group sustainability, 2) Structural cohesion (represented by high
group density and low average shortest path length) is a strong indicator of
group sustainability, and 3) Event characteristics play a role in shaping group
sustainability, as social groups in transient events behave differently from
groups in events that last longer
YouTube AV 50K: An Annotated Corpus for Comments in Autonomous Vehicles
With one billion monthly viewers, and millions of users discussing and
sharing opinions, comments below YouTube videos are rich sources of data for
opinion mining and sentiment analysis. We introduce the YouTube AV 50K dataset,
a freely-available collections of more than 50,000 YouTube comments and
metadata below autonomous vehicle (AV)-related videos. We describe its creation
process, its content and data format, and discuss its possible usages.
Especially, we do a case study of the first self-driving car fatality to
evaluate the dataset, and show how we can use this dataset to better understand
public attitudes toward self-driving cars and public reactions to the accident.
Future developments of the dataset are also discussed.Comment: in Proceedings of the Thirteenth International Joint Symposium on
Artificial Intelligence and Natural Language Processing (iSAI-NLP 2018
- …