2,121 research outputs found
Self-supervised automated wrapper generation for weblog data extraction
Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives
The Blogosphere at a Glance — Content-Based Structures Made Simple
A network representation based on a basic wordoverlap
similarity measure between blogs is introduced.
The simplicity of the representation renders
it computationally tractable, transparent and insensitive
to representation-dependent artifacts. Using
Swedish blog data, we demonstrate that the representation,
in spite of its simplicity, manages to capture
important structural properties of the content
in the blogosphere. First, blogs that treat similar
subjects are organized in distinct network clusters.
Second, the network is hierarchically organized as
clusters in turn form higher-order clusters: a compound
structure reminiscent of a blog taxonomy
Coping with noise in a real-world weblog crawler and retrieval system
In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself
Topic-dependent sentiment analysis of financial blogs
While most work in sentiment analysis in the financial domain has focused on the use of content from traditional finance news, in this work we concentrate on more subjective sources of information, blogs. We aim to automatically determine the sentiment of financial bloggers towards companies and their stocks. To do this we develop a corpus of financial blogs, annotated with polarity of sentiment with respect to a number of companies. We conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, we propose text extraction techniques to create topic-specific sub-documents, which we use to train a sentiment classifier. We show that such approaches provide a substantial improvement over full documentclassification and that word-based approaches perform better than sentence-based or paragraph-based approaches
Contextualizing the blogosphere: A comparison of traditional and novel user interfaces for the web
In this paper, we investigate how contextual user interfaces affect blog reading experience. Based on a review of previous research, we argue why and how contextualization may result in (H1) enhanced blog reading experiences. In an eyetracking experiment, we tested 3 different web-based user interfaces for information spaces. The StarTree interface (by Inxight) and the Focus-Metaphor interface are compared with a standard blog interface. Information tasks have been used to evaluate and compare task performance and user satisfaction between these three interfaces. We found that both contextual user interfaces clearly outperformed the traditional blog interface, both in terms of task performance as well as user satisfaction. © 2007 Laqua, S., Ogbechie, N. and Sasse, M. A
The Tumblarians
This paper examines the tumblarians as an information community and discusses community membership, information behaviours, and complementary models for a situated understanding of this unique personal-professional community. A review of the literature concerning LIS bloggers is presented as a complement to the tumblarians, who have no in depth treatment in the research as yet. Characteristics particular to the tumblarians are explored through informal conversation with a community member, and Fisher, Unruh, and Durrance\u27s (2003) information communities model is employed to provide a deeper understanding of the information behaviour of the tumblarians. This paper offers suggestions for future research based on the preliminary findings of the tumblarians as LIS bloggers and a virtual community
Models of Social Groups in Blogosphere Based on Information about Comment Addressees and Sentiments
This work concerns the analysis of number, sizes and other characteristics of
groups identified in the blogosphere using a set of models identifying social
relations. These models differ regarding identification of social relations,
influenced by methods of classifying the addressee of the comments (they are
either the post author or the author of a comment on which this comment is
directly addressing) and by a sentiment calculated for comments considering the
statistics of words present and connotation. The state of a selected blog
portal was analyzed in sequential, partly overlapping time intervals. Groups in
each interval were identified using a version of the CPM algorithm, on the
basis of them, stable groups, existing for at least a minimal assumed duration
of time, were identified.Comment: Gliwa B., Ko\'zlak J., Zygmunt A., Models of Social Groups in
Blogosphere Based on Information about Comment Addressees and Sentiments, in
the K. Aberer et al. (Eds.): SocInfo 2012, LNCS 7710, pp. 475-488, Best Paper
Awar
- …