920 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Studying Media Events through Spatio-Temporal Statistical Analysis

    Get PDF
    This report is written in the context of the ANR Geomedia and summarises the developement of methods of spatio-temporel statistical analysis of media events (delivrable 3.2).This documents presents on-going work on statistical modelling and statistical inference of the ANR GEOMEDIA corpus, that is a collection of international RSS news feeds. Central to this project, RSS news feeds are viewed as a representation of the information flow in geopolitical space. As such they allow us to study media events of global extent and how they affect international relations. Here we propose hidden Markov models (HMM) as an adequate modelling framework to study the evolution of media events in time. This set of models respect the characteristic properties of the data, such as temporal dependencies and correlations between feeds. Its specific structure corresponds well to our conceptualisation of media attention and media events. We specify the general model structure that we use for modelling an ensemble of RSS news feeds. Finally, we apply the proposed models to a case study dedicated to the analysis of the media attention for the Ebola epidemic which spread through West Africa in 2014.Ce document prĂ©sente les rĂ©sultats d'un travail en cours sur la modĂ©lisation statistique et l'infĂ©rence appliquĂ© au corpus de l'ANR GEOMEDIA qui est une collection des flux RSS internationaux. Au coeur du projet, les flux RSS sont considĂ©rĂ©s comme un marqueur reprĂ©sentatif des flux d'information dans l'espace gĂ©opolitique mondial. En tant que tel, ils nous permettent d'Ă©tudier des Ă©vĂ©nements mĂ©diatiques globaux et leur impact sur les relations internationales. Dans ce contexte, on Ă©met l'hypothĂšse que les modĂšles Markoviens cachĂ©s (HMM) constituent un cadre mĂ©thodologique adaptĂ© pour modĂ©liser et Ă©tudier l'Ă©volution des Ă©vĂ©nements mĂ©diatiques dans le temps. Ces modĂšles respectent les propriĂ©tĂ©s des donnĂ©es, comme les corrĂ©lations temporelles et les redondances entre flux. Leur structure caractĂ©ristique correspond Ă  notre conceptualisation de l'attention mĂ©diatique et des Ă©vĂ©nements mĂ©diatiques. Nous spĂ©cifions la structure gĂ©nĂ©ral d'un modĂšle HMM qui peut ĂȘtre appliquĂ© a la modĂ©lisation simultanĂ© d'un ensemble des flux RSS. Finalement, on teste l'intĂ©rĂȘt des modĂšles proposĂ©s Ă  l'aide d'une Ă©tude de cas dĂ©diĂ© Ă  l'analyse de l'attention mĂ©diatique pour l'Ă©pidĂ©mie d'Ebola en Afrique de l'Ouest en 2014

    Geospatial database generation from digital newspapers: use case for risk and disaster domains.

    Get PDF
    Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies.The generation of geospatial databases is expensive in terms of time and money. Many geospatial users still lack spatial data. Geographic Information Extraction and Retrieval systems can alleviate this problem. This work proposes a method to populate spatial databases automatically from the Web. It applies the approach to the risk and disaster domain taking digital newspapers as a data source. News stories on digital newspapers contain rich thematic information that can be attached to places. The use case of automating spatial database generation is applied to Mexico using placenames. In Mexico, small and medium disasters occur most years. The facts about these are frequently mentioned in newspapers but rarely stored as records in national databases. Therefore, it is difficult to estimate human and material losses of those events. This work present two ways to extract information from digital news using natural languages techniques for distilling the text, and the national gazetteer codes to achieve placename-attribute disambiguation. Two outputs are presented; a general one that exposes highly relevant news, and another that attaches attributes of interest to placenames. The later achieved a 75% rate of thematic relevance under qualitative analysis

    Being Omnipresent To Be Almighty: The Importance of The Global Web Evidence for Organizational Expert Finding

    Get PDF
    Modern expert nding algorithms are developed under the assumption that all possible expertise evidence for a person is concentrated in a company that currently employs the person. The evidence that can be acquired outside of an enterprise is traditionally unnoticed. At the same time, the Web is full of personal information which is sufficiently detailed to judge about a person's skills and knowledge. In this work, we review various sources of expertise evidence out-side of an organization and experiment with rankings built on the data acquired from six dierent sources, accessible through APIs of two major web search engines. We show that these rankings and their combinations are often more realistic and of higher quality than rankings built on organizational data only

    Financial news analysis using a semantic web approach

    Get PDF
    In this paper we present StockWatcher, an OWL-based web application that enables the extraction of relevant news items from RSS feeds concerning the NASDAQ-100 listed companies. The application's goal is to present a customized, aggregated view of the news categorized by different topics. We distinguish between four relevant news categories: i) news regarding the company itself, ii) news regarding direct competitors of the company, iii) news regarding important people of the company, and iv) news regarding the industry in which the company is active. At the same time, the system presented in this chapter is able to rate these news items based on their relevance. We identify three possible effects that a news message can have on the company, and thus on the stock price of that company: i) positive, ii) negative, and iii) neutral. Currently, StockWatcher provides support for the NASDAQ-100 companies. The selection of the relevant news items is based on a customizable user portfolio that may consist of one or more of these companies
    • 

    corecore