920 research outputs found
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Studying Media Events through Spatio-Temporal Statistical Analysis
This report is written in the context of the ANR Geomedia and summarises the developement of methods of spatio-temporel statistical analysis of media events (delivrable 3.2).This documents presents on-going work on statistical modelling and statistical inference of the ANR GEOMEDIA corpus, that is a collection of international RSS news feeds. Central to this project, RSS news feeds are viewed as a representation of the information flow in geopolitical space. As such they allow us to study media events of global extent and how they affect international relations. Here we propose hidden Markov models (HMM) as an adequate modelling framework to study the evolution of media events in time. This set of models respect the characteristic properties of the data, such as temporal dependencies and correlations between feeds. Its specific structure corresponds well to our conceptualisation of media attention and media events. We specify the general model structure that we use for modelling an ensemble of RSS news feeds. Finally, we apply the proposed models to a case study dedicated to the analysis of the media attention for the Ebola epidemic which spread through West Africa in 2014.Ce document prĂ©sente les rĂ©sultats d'un travail en cours sur la modĂ©lisation statistique et l'infĂ©rence appliquĂ© au corpus de l'ANR GEOMEDIA qui est une collection des flux RSS internationaux. Au coeur du projet, les flux RSS sont considĂ©rĂ©s comme un marqueur reprĂ©sentatif des flux d'information dans l'espace gĂ©opolitique mondial. En tant que tel, ils nous permettent d'Ă©tudier des Ă©vĂ©nements mĂ©diatiques globaux et leur impact sur les relations internationales. Dans ce contexte, on Ă©met l'hypothĂšse que les modĂšles Markoviens cachĂ©s (HMM) constituent un cadre mĂ©thodologique adaptĂ© pour modĂ©liser et Ă©tudier l'Ă©volution des Ă©vĂ©nements mĂ©diatiques dans le temps. Ces modĂšles respectent les propriĂ©tĂ©s des donnĂ©es, comme les corrĂ©lations temporelles et les redondances entre flux. Leur structure caractĂ©ristique correspond Ă notre conceptualisation de l'attention mĂ©diatique et des Ă©vĂ©nements mĂ©diatiques. Nous spĂ©cifions la structure gĂ©nĂ©ral d'un modĂšle HMM qui peut ĂȘtre appliquĂ© a la modĂ©lisation simultanĂ© d'un ensemble des flux RSS. Finalement, on teste l'intĂ©rĂȘt des modĂšles proposĂ©s Ă l'aide d'une Ă©tude de cas dĂ©diĂ© Ă l'analyse de l'attention mĂ©diatique pour l'Ă©pidĂ©mie d'Ebola en Afrique de l'Ouest en 2014
Geospatial database generation from digital newspapers: use case for risk and disaster domains.
Dissertation submitted in partial fulfilment of the requirements for the Degree of Master of Science in Geospatial Technologies.The generation of geospatial databases is expensive in terms of time
and money. Many geospatial users still lack spatial data. Geographic
Information Extraction and Retrieval systems can alleviate this problem.
This work proposes a method to populate spatial databases automatically
from the Web. It applies the approach to the risk and disaster domain
taking digital newspapers as a data source. News stories on digital
newspapers contain rich thematic information that can be attached
to places. The use case of automating spatial database generation is
applied to Mexico using placenames. In Mexico, small and medium
disasters occur most years. The facts about these are frequently mentioned
in newspapers but rarely stored as records in national databases.
Therefore, it is difficult to estimate human and material losses of those
events.
This work present two ways to extract information from digital news
using natural languages techniques for distilling the text, and the national
gazetteer codes to achieve placename-attribute disambiguation.
Two outputs are presented; a general one that exposes highly relevant
news, and another that attaches attributes of interest to placenames.
The later achieved a 75% rate of thematic relevance under qualitative
analysis
Being Omnipresent To Be Almighty: The Importance of The Global Web Evidence for Organizational Expert Finding
Modern expert nding algorithms are developed under the
assumption that all possible expertise evidence for a person
is concentrated in a company that currently employs the
person. The evidence that can be acquired outside of an
enterprise is traditionally unnoticed. At the same time, the
Web is full of personal information which is sufficiently detailed to judge about a person's skills and knowledge. In this work, we review various sources of expertise evidence out-side of an organization and experiment with rankings built on the data acquired from six dierent sources, accessible through APIs of two major web search engines. We show that these rankings and their combinations are often more realistic and of higher quality than rankings built on organizational data only
Financial news analysis using a semantic web approach
In this paper we present StockWatcher, an OWL-based web application that enables the extraction of relevant news items from RSS feeds concerning the NASDAQ-100 listed companies. The application's goal is to present a customized, aggregated view of the news categorized by different topics. We distinguish between four relevant news categories: i) news regarding the company itself, ii) news regarding direct competitors of the company, iii) news regarding important people of the company, and iv) news regarding the industry in which the company is active. At the same time, the system presented in this chapter is able to rate these news items based on their relevance. We identify three possible effects that a news message can have on the company, and thus on the stock price of that company: i) positive, ii) negative, and iii) neutral. Currently, StockWatcher provides support for the NASDAQ-100 companies. The selection of the relevant news items is based on a customizable user portfolio that may consist of one or more of these companies
- âŠ