9,261 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

    Full text link
    Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec---two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple kk-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.Comment: 10 page

    A Topic Recommender for Journalists

    Get PDF
    The way in which people acquire information on events and form their own opinion on them has changed dramatically with the advent of social media. For many readers, the news gathered from online sources become an opportunity to share points of view and information within micro-blogging platforms such as Twitter, mainly aimed at satisfying their communication needs. Furthermore, the need to deepen the aspects related to news stimulates a demand for additional information which is often met through online encyclopedias, such as Wikipedia. This behaviour has also influenced the way in which journalists write their articles, requiring a careful assessment of what actually interests the readers. The goal of this paper is to present a recommender system, What to Write and Why, capable of suggesting to a journalist, for a given event, the aspects still uncovered in news articles on which the readers focus their interest. The basic idea is to characterize an event according to the echo it receives in online news sources and associate it with the corresponding readers’ communicative and informative patterns, detected through the analysis of Twitter and Wikipedia, respectively. Our methodology temporally aligns the results of this analysis and recommends the concepts that emerge as topics of interest from Twitter and Wikipedia, either not covered or poorly covered in the published news articles

    BlogForever: D3.1 Preservation Strategy Report

    Get PDF
    This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    Local Information Diffusion Patterns in Social and Traditional Media: The Estonian Case Study

    Get PDF
    Paljud ettevĂ”tted ja inimesed hindavad kĂ”rgelt informatsiooni vÀÀrtust ja seda on eelkĂ”ige hakatud hindama viimase kĂŒmnekonna aasta jooksul. TĂ€nu sellele on tekkinud ka huvi, kuidas info levib erinevates struktureeritud vĂ”rgustikes. Avaldatud on mitmeid teadustöid, mis uurivad informatsiooni levimist ĂŒhes reaalse elu vĂ”rgustikus nagu nĂ€iteks Facebooki postitused, Twitteri tweetid, Blogspoti blogikanded jne. Suuresti on need uurimused keskendunud ĂŒhele vĂ”rgustikule, mis ei hĂ”lma kogu vĂ”rgu dĂŒnaamikat ja samuti vĂ€list mĂ”ju info levimisele. Samas on lĂ€himinevikus avaldatud ka teadustöid, mis hĂ”lmavad mitut erinevat vĂ”rgustiku ja analĂŒĂŒsivad vĂ€list mĂ”ju informatsiooni levimisele. KĂ€esoleva töö eesmĂ€rk on lĂ€hemalt uurida informatsiooni levimise mustreid vĂ”rgustikus, mis hĂ”lmab erinevaid reaalelu vĂ”rgustike, kasutades selleks topoloogilisi ja aja mustreid. Topoloogiliste mustrite analĂŒĂŒsimiseks on kasutatud vĂ”rgustikus sagedalt levivate alamgraafide leidmise algoritme, aja mustreid uuritakse ajaseeriate klasterdamise teel. Töös kasutatud andmestik on kogutud Eesti uudismeediast - artiklid ja nende kommentaarid ning sotsiaalmeedia kanalitest, Twitterist ja Facebook-ist. Selle andmestiku pĂ”hjal loodi seosed eritĂŒĂŒpi andmeobjektide vahel, mille pĂ”hjal loodi vĂ”rgustik, mida kasutada edasiseks uurimiseks. Aja mustrid viitavad vĂ€ga kiirele info levimisele antud vĂ”rgustikus, topoloogilised mustrid nĂ€itavad uudismeedia artiklite ja Facebook-i postituste suurt mĂ”ju info levimises. Töö tulemusi on vĂ”imalik rakendada kĂŒberkaitses, online turunduses ja kampaania haldamises, samuti ka mĂ”juvĂ”imu hindamisel - kindlasti leiaks tulemused rakendust ka teistes valdkondades.Information has become more highly valued among companies and individuals than ever before. With this, the interest in how information diffuses among the entities in various structured networks has increased. A number of studies have been published on the diffusion process in real-life networks, such as web service network, citation networks, blog networks etc. Majority of researches have focused on one type of network - such as Facebook posts, Twitter tweets, Blogspot blog entries etc. A disadvantage of analysing a network containing entities from a single source is that it does not consider the outside influence on the diffusion. Recently, some papers have started to incorporate different networks in their study and as such have been able to analyse the effect of outside influence on the diffusion process. This thesis aims to shed further light into the topic of information diffusion in a real world network containing entities from different sources, this is achieved by detection of relevant local topological and temporal information diffusion patterns. For topological pattern analysis, frequent subgraph mining techniques are used. Temporal patterns are extracted using time series clustering. The dataset used in this thesis is collected from the Estonian setting of mainstream online news media with comments and articles and from social media channels Twitter and Facebook. From this dataset the relations between the entities were extracted and a network for analysis of diffusion patterns was constructed. Temporal patterns reveal the high pace of information diffusion while topological patterns expose the important role of news media articles and Facebook posts in the information diffusion processes. The results of the thesis are applicable in cyber defence, online marketing and campaign management plus information impact estimation, just to mention a few application areas

    Role of sentiment classification in sentiment analysis: a survey

    Get PDF
    Through a survey of literature, the role of sentiment classification in sentiment analysis has been reviewed. The review identifies the research challenges involved in tackling sentiment classification. A total of 68 articles during 2015 – 2017 have been reviewed on six dimensions viz., sentiment classification, feature extraction, cross-lingual sentiment classification, cross-domain sentiment classification, lexica and corpora creation and multi-label sentiment classification. This study discusses the prominence and effects of sentiment classification in sentiment evaluation and a lot of further research needs to be done for productive results
    • 

    corecore