2,543 research outputs found

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    Language Use Matters: Analysis of the Linguistic Structure of Question Texts Can Characterize Answerability in Quora

    Full text link
    Quora is one of the most popular community Q&A sites of recent times. However, many question posts on this Q&A site often do not get answered. In this paper, we quantify various linguistic activities that discriminates an answered question from an unanswered one. Our central finding is that the way users use language while writing the question text can be a very effective means to characterize answerability. This characterization helps us to predict early if a question remaining unanswered for a specific time period t will eventually be answered or not and achieve an accuracy of 76.26% (t = 1 month) and 68.33% (t = 3 months). Notably, features representing the language use patterns of the users are most discriminative and alone account for an accuracy of 74.18%. We also compare our method with some of the similar works (Dror et al., Yang et al.) achieving a maximum improvement of ~39% in terms of accuracy.Comment: 1 figure, 3 tables, ICWSM 2017 as poste

    A Topic-Agnostic Approach for Identifying Fake News Pages

    Full text link
    Fake news and misinformation have been increasingly used to manipulate popular opinion and influence political processes. To better understand fake news, how they are propagated, and how to counter their effect, it is necessary to first identify them. Recently, approaches have been proposed to automatically classify articles as fake based on their content. An important challenge for these approaches comes from the dynamic nature of news: as new political events are covered, topics and discourse constantly change and thus, a classifier trained using content from articles published at a given time is likely to become ineffective in the future. To address this challenge, we propose a topic-agnostic (TAG) classification strategy that uses linguistic and web-markup features to identify fake news pages. We report experimental results using multiple data sets which show that our approach attains high accuracy in the identification of fake news, even as topics evolve over time.Comment: Accepted for publication in the Companion Proceedings of the 2019 World Wide Web Conference (WWW'19 Companion). Presented in the 2019 International Workshop on Misinformation, Computational Fact-Checking and Credible Web (MisinfoWorkshop2019). 6 page

    Readability of Privacy Policies of Healthcare Websites

    Get PDF
    Health-related personal information is very privacy-sensitive. Online privacy policies inform Website users about the ways their personal information is gathered, processed and stored. In the light of increasing privacy concerns, privacy policies seem to be an important mechanism for increasing customer loyalty. However, in practice, consumers only rarely read privacy policies, possibly due to the common assumption that policies are hard to read. By designing and implementing an automated extraction and readability analysis toolset, we present the first study that provides empirical evidence on readability of over 5,000 privacy policies of health websites and over 1,000 privacy policies of top e-commerce sites. Our results confirm the difficulty of reading current privacy policies. We further show that health websites\u27 policies are more readable than top e-commerce ones, but policies of non-commercial health websites are worse readable than commercial ones. Our study also provides a solid policy text corpus for further research
    • …
    corecore