1,316 research outputs found

    Multifaceted Geotagging for Streaming News

    Get PDF
    News sources on the Web generate constant streams of information, describing the events that shape our world. In particular, geography plays a key role in the news, and understanding the geographic information present in news allows for its useful spatial browsing and retrieval. This process of understanding is called geotagging, and involves first finding in the document all textual references to geographic locations, known as toponyms, and second, assigning the correct lat/long values to each toponym, steps which are termed toponym recognition and toponym resolution, respectively. These steps are difficult due to ambiguities in natural language: some toponyms share names with non-location entities, and further, a given toponym can have many location interpretations. Removing these ambiguities is crucial for successful geotagging. To this end, geotagging methods are described which were developed for streaming news. First, a spatio-textual search engine named STEWARD, and an interactive map-based news browsing system named NewsStand are described, which feature geotaggers as central components, and served as motivating systems and experimental testbeds for developing geotagging methods. Next, a geotagging methodology is presented that follows a multifaceted approach involving a variety of techniques. First, a multifaceted toponym recognition process is described that uses both rule-based and machine learning–based methods to ensure high toponym recall. Next, various forms of toponym resolution evidence are explored. One such type of evidence is lists of toponyms, termed comma groups, whose toponyms share a common thread in their geographic properties that enables correct resolution. In addition to explicit evidence, authors take advantage of the implicit geographic knowledge of their audiences. Understanding the local places known by an audience, termed its local lexicon, affords great performance gains when geotagging articles from local newspapers, which account for the vast majority of news on the Web. Finally, considering windows of text of varying size around each toponym, termed adaptive context, allows for a tradeoff between geotagging execution speed and toponym resolution accuracy. Extensive experimental evaluations of all the above methods, using existing and two newly-created, large corpora of streaming news, show great performance gains over several competing prominent geotagging methods

    RSS Feeds, Browsing and End-User Engagement

    Get PDF
    Despite the vast amount of research that has been devoted separately to the topics of browsing and Really Simple Syndication (RSS) aggregation architecture, little is known about how end-users engage with RSS feeds and how they browse while using a feed aggregate. This study explores the browsing behaviors end-users exhibit when using RSS and Atom feeds. The researcher analyzed end-users’ browsing experiences and discusses browsing variations. The researcher observed, tested, and interviewed eighteen (N=18) undergraduate students at the University of Tennessee to determine how end-users engage with RSS feeds. This study evaluates browsing using two variations of tasks, (1) an implicit task with no final goal and (2) an explicit task with a final goal. The researcher observed the participants complete the two tasks and conducted exit interviews, which addressed the end-users’ experiences with Google Reader and provided further explanation of browsing behaviors. The researcher analyzed the browsing behaviors based upon Bates’ (2007) definitions and characteristics of browsing. The results of this exploratory research provide insights into end-user interaction with RSS feeds

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Query by document

    Get PDF
    We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and microblogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given “query document ” to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons’s Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document. 1

    Mind the Gap:Newsfeed Visualisation with Metro Maps

    Get PDF

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    A Study of Information Fragment Association in Information Management and Retrieval Applications

    Get PDF
    As we strive to identify useful information sifting through the vast number of resources available to us, we often find that the desired information is residing in a small section within a larger body of content which does not necessarily contain similar information. This can make this Information Fragment difficult to find. A Web search engine may not provide a good ranking to a page of unrelated content if it contains only a very small yet invaluable piece of relevant information. This means that our processes often fail to bring together related Information Fragments. We can easily conceive of two Information Fragments which according to a scholar bear a strong association with each other, yet contain no common keywords enabling them to be collocated by a keyword search.This dissertation attempts to address this issue by determining the benefits of enhancing information management and retrieval applications by providing users with the capability of establishing and storing associations between Information Fragments. It estimates the extent to which the efficiency and quality of information retrieval can be improved if users are allowed to capture mental associations they form while reading Information Fragments and share these associations with others using a functional registry-based design. In order to test these benefits three subject groups were recruited and assigned tasks involving Information Fragments. The first two tasks compared the performance and usability of a mainstream social bookmarking tool with a tool enhanced with Information Fragment Association capabilities. The tests demonstrated that the use of Information Fragment Association offers significant advantages both in the efficiency of retrieval and user satisfaction. Analysis of the results of the third task demonstrated that a mainstream Web search engine performed poorly in collocating interrelated fragments when a query designed to retrieve the one of these fragments was submitted. The fourth task demonstrated that Information Fragment Association improves the precision and recall of searches performed on Information Fragment datasets.The results of this study indicate that mainstream information management and retrieval applications provide inadequate support for Information Fragment retrieval and that their enhancement with Information Fragment Association capabilities would be beneficial
    corecore