2,121 research outputs found

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    The Blogosphere at a Glance — Content-Based Structures Made Simple

    Get PDF
    A network representation based on a basic wordoverlap similarity measure between blogs is introduced. The simplicity of the representation renders it computationally tractable, transparent and insensitive to representation-dependent artifacts. Using Swedish blog data, we demonstrate that the representation, in spite of its simplicity, manages to capture important structural properties of the content in the blogosphere. First, blogs that treat similar subjects are organized in distinct network clusters. Second, the network is hierarchically organized as clusters in turn form higher-order clusters: a compound structure reminiscent of a blog taxonomy

    Coping with noise in a real-world weblog crawler and retrieval system

    Get PDF
    In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself

    Topic-dependent sentiment analysis of financial blogs

    Get PDF
    While most work in sentiment analysis in the financial domain has focused on the use of content from traditional finance news, in this work we concentrate on more subjective sources of information, blogs. We aim to automatically determine the sentiment of financial bloggers towards companies and their stocks. To do this we develop a corpus of financial blogs, annotated with polarity of sentiment with respect to a number of companies. We conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, we propose text extraction techniques to create topic-specific sub-documents, which we use to train a sentiment classifier. We show that such approaches provide a substantial improvement over full documentclassification and that word-based approaches perform better than sentence-based or paragraph-based approaches

    Contextualizing the blogosphere: A comparison of traditional and novel user interfaces for the web

    Get PDF
    In this paper, we investigate how contextual user interfaces affect blog reading experience. Based on a review of previous research, we argue why and how contextualization may result in (H1) enhanced blog reading experiences. In an eyetracking experiment, we tested 3 different web-based user interfaces for information spaces. The StarTree interface (by Inxight) and the Focus-Metaphor interface are compared with a standard blog interface. Information tasks have been used to evaluate and compare task performance and user satisfaction between these three interfaces. We found that both contextual user interfaces clearly outperformed the traditional blog interface, both in terms of task performance as well as user satisfaction. © 2007 Laqua, S., Ogbechie, N. and Sasse, M. A

    The Tumblarians

    Get PDF
    This paper examines the tumblarians as an information community and discusses community membership, information behaviours, and complementary models for a situated understanding of this unique personal-professional community. A review of the literature concerning LIS bloggers is presented as a complement to the tumblarians, who have no in depth treatment in the research as yet. Characteristics particular to the tumblarians are explored through informal conversation with a community member, and Fisher, Unruh, and Durrance\u27s (2003) information communities model is employed to provide a deeper understanding of the information behaviour of the tumblarians. This paper offers suggestions for future research based on the preliminary findings of the tumblarians as LIS bloggers and a virtual community

    Models of Social Groups in Blogosphere Based on Information about Comment Addressees and Sentiments

    Full text link
    This work concerns the analysis of number, sizes and other characteristics of groups identified in the blogosphere using a set of models identifying social relations. These models differ regarding identification of social relations, influenced by methods of classifying the addressee of the comments (they are either the post author or the author of a comment on which this comment is directly addressing) and by a sentiment calculated for comments considering the statistics of words present and connotation. The state of a selected blog portal was analyzed in sequential, partly overlapping time intervals. Groups in each interval were identified using a version of the CPM algorithm, on the basis of them, stable groups, existing for at least a minimal assumed duration of time, were identified.Comment: Gliwa B., Ko\'zlak J., Zygmunt A., Models of Social Groups in Blogosphere Based on Information about Comment Addressees and Sentiments, in the K. Aberer et al. (Eds.): SocInfo 2012, LNCS 7710, pp. 475-488, Best Paper Awar
    corecore