25,834 research outputs found
Template Mining for Information Extraction from Digital Documents
published or submitted for publicatio
A Topic-Agnostic Approach for Identifying Fake News Pages
Fake news and misinformation have been increasingly used to manipulate
popular opinion and influence political processes. To better understand fake
news, how they are propagated, and how to counter their effect, it is necessary
to first identify them. Recently, approaches have been proposed to
automatically classify articles as fake based on their content. An important
challenge for these approaches comes from the dynamic nature of news: as new
political events are covered, topics and discourse constantly change and thus,
a classifier trained using content from articles published at a given time is
likely to become ineffective in the future. To address this challenge, we
propose a topic-agnostic (TAG) classification strategy that uses linguistic and
web-markup features to identify fake news pages. We report experimental results
using multiple data sets which show that our approach attains high accuracy in
the identification of fake news, even as topics evolve over time.Comment: Accepted for publication in the Companion Proceedings of the 2019
World Wide Web Conference (WWW'19 Companion). Presented in the 2019
International Workshop on Misinformation, Computational Fact-Checking and
Credible Web (MisinfoWorkshop2019). 6 page
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
Learning to Extract Keyphrases from Text
Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97)
Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media
Most of the online news media outlets rely heavily on the revenues generated
from the clicks made by their readers, and due to the presence of numerous such
outlets, they need to compete with each other for reader attention. To attract
the readers to click on an article and subsequently visit the media site, the
outlets often come up with catchy headlines accompanying the article links,
which lure the readers to click on the link. Such headlines are known as
Clickbaits. While these baits may trick the readers into clicking, in the long
run, clickbaits usually don't live up to the expectation of the readers, and
leave them disappointed.
In this work, we attempt to automatically detect clickbaits and then build a
browser extension which warns the readers of different media sites about the
possibility of being baited by such headlines. The extension also offers each
reader an option to block clickbaits she doesn't want to see. Then, using such
reader choices, the extension automatically blocks similar clickbaits during
her future visits. We run extensive offline and online experiments across
multiple media sites and find that the proposed clickbait detection and the
personalized blocking approaches perform very well achieving 93% accuracy in
detecting and 89% accuracy in blocking clickbaits.Comment: 2016 IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining (ASONAM
- …