675 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Text Mining Infrastructure in R

    Get PDF
    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

    The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness

    Full text link
    Twitter updates now represent an enormous stream of information originating from a wide variety of formal and informal sources, much of which is relevant to real-world events. In this paper we adapt existing bio-surveillance algorithms to detect localised spikes in Twitter activity corresponding to real events with a high level of confidence. We then develop a methodology to automatically summarise these events, both by providing the tweets which fully describe the event and by linking to highly relevant news articles. We apply our methods to outbreaks of illness and events strongly affecting sentiment. In both case studies we are able to detect events verifiable by third party sources and produce high quality summaries

    Enhanced information retrieval by exploiting recommender techniques in cluster-based link analysis

    Get PDF
    Inspired by the use of PageRank algorithms in document ranking, we develop and evaluate a cluster-based PageRank algorithm to re-rank information retrieval (IR) output with the objective of improving ad hoc search effectiveness. Unlike existing work, our methods exploit recommender techniques to extract the correlation between documents and apply detected correlations in a cluster-based PageRank algorithm to compute the importance of each document in a dataset. In this study two popular recommender techniques are examined in four proposed PageRank models to investigate the effectiveness of our approach. Comparison of our methods with strong baselines demonstrates the solid performance of our approach. Experimental results are reported on an extended version of the FIRE 2011 personal information retrieval (PIR) data collection which includes topically related queries with click-through data and relevance assessment data collected from the query creators. The search logs of the query creators are categorized based on their different topical interests. The experimental results show the significant improvement of our approach compared to results using standard IR and cluster-based PageRank methods

    A series of case studies to enhance the social utility of RSS

    Get PDF
    RSS (really simple syndication, rich site summary or RDF site summary) is a dialect of XML that provides a method of syndicating on-line content, where postings consist of frequently updated news items, blog entries and multimedia. RSS feeds, produced by organisations or individuals, are often aggregated, and delivered to users for consumption via readers. The semi-structured format of RSS also allows the delivery/exchange of machine-readable content between different platforms and systems. Articles on web pages frequently include icons that represent social media services which facilitate social data. Amongst these, RSS feeds deliver data which is typically presented in the journalistic style of headline, story and snapshot(s). Consequently, applications and academic research have employed RSS on this basis. Therefore, within the context of social media, the question arises: can the social function, i.e. utility, of RSS be enhanced by producing from it data which is actionable and effective? This thesis is based upon the hypothesis that the fluctuations in the keyword frequencies present in RSS can be mined to produce actionable and effective data, to enhance the technology's social utility. To this end, we present a series of laboratory-based case studies which demonstrate two novel and logically consistent RSS-mining paradigms. Our first paradigm allows users to define mining rules to mine data from feeds. The second paradigm employs a semi-automated classification of feeds and correlates this with sentiment. We visualise the outputs produced by the case studies for these paradigms, where they can benefit users in real-world scenarios, varying from statistics and trend analysis to mining financial and sporting data. The contributions of this thesis to web engineering and text mining are the demonstration of the proof of concept of our paradigms, through the integration of an array of open-source, third-party products into a coherent and innovative, alpha-version prototype software implemented in a Java JSP/servlet-based web application architecture

    Multifaceted Geotagging for Streaming News

    Get PDF
    News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging. To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods
    corecore