2,543 research outputs found
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Language Use Matters: Analysis of the Linguistic Structure of Question Texts Can Characterize Answerability in Quora
Quora is one of the most popular community Q&A sites of recent times.
However, many question posts on this Q&A site often do not get answered. In
this paper, we quantify various linguistic activities that discriminates an
answered question from an unanswered one. Our central finding is that the way
users use language while writing the question text can be a very effective
means to characterize answerability. This characterization helps us to predict
early if a question remaining unanswered for a specific time period t will
eventually be answered or not and achieve an accuracy of 76.26% (t = 1 month)
and 68.33% (t = 3 months). Notably, features representing the language use
patterns of the users are most discriminative and alone account for an accuracy
of 74.18%. We also compare our method with some of the similar works (Dror et
al., Yang et al.) achieving a maximum improvement of ~39% in terms of accuracy.Comment: 1 figure, 3 tables, ICWSM 2017 as poste
A Topic-Agnostic Approach for Identifying Fake News Pages
Fake news and misinformation have been increasingly used to manipulate
popular opinion and influence political processes. To better understand fake
news, how they are propagated, and how to counter their effect, it is necessary
to first identify them. Recently, approaches have been proposed to
automatically classify articles as fake based on their content. An important
challenge for these approaches comes from the dynamic nature of news: as new
political events are covered, topics and discourse constantly change and thus,
a classifier trained using content from articles published at a given time is
likely to become ineffective in the future. To address this challenge, we
propose a topic-agnostic (TAG) classification strategy that uses linguistic and
web-markup features to identify fake news pages. We report experimental results
using multiple data sets which show that our approach attains high accuracy in
the identification of fake news, even as topics evolve over time.Comment: Accepted for publication in the Companion Proceedings of the 2019
World Wide Web Conference (WWW'19 Companion). Presented in the 2019
International Workshop on Misinformation, Computational Fact-Checking and
Credible Web (MisinfoWorkshop2019). 6 page
Readability of Privacy Policies of Healthcare Websites
Health-related personal information is very privacy-sensitive. Online privacy policies inform Website users about the ways their personal information is gathered, processed and stored. In the light of increasing privacy concerns, privacy policies seem to be an important mechanism for increasing customer loyalty. However, in practice, consumers only rarely read privacy policies, possibly due to the common assumption that policies are hard to read. By designing and implementing an automated extraction and readability analysis toolset, we present the first study that provides empirical evidence on readability of over 5,000 privacy policies of health websites and over 1,000 privacy policies of top e-commerce sites. Our results confirm the difficulty of reading current privacy policies. We further show that health websites\u27 policies are more readable than top e-commerce ones, but policies of non-commercial health websites are worse readable than commercial ones. Our study also provides a solid policy text corpus for further research
- …