31,912 research outputs found
BlogForever D2.4: Weblog spider prototype and associated methodology
The purpose of this document is to present the evaluation of different solutions for capturing blogs, established methodology and to describe the developed blog spider prototype
Keeping Up To Date with IP News Services and Blogs: Drowning in a Sea Of Sameness?
It seems like so many IP related Websites you visit invite you to join their free email list to keep you up to date. Sources span a wide spectrum including governmental organizations, non-governmental organizations, educational institutions, consulting services, law firms, commercial publishers and more. These sources span the spectrum from free, to low fee to premium pricing. With all of this information overload and choices, how do you differentiate and choose news sources?
The goals of this article are twofold. Goal one is to present a survey of types and categories of IP news tools available to IP researchers. Since these tools change with time, goal two is to present strategies and approaches to consider when assembling your portfolio of news sources. I use the term researcher to include anyone looking for news, including lawyers, paraprofessionals, academics, students, corporate searchers and more. Some of this material may be yesterday\u27s news for some and breaking news for others. My hope is that you will find value added in some tools and strategies.
Before I present the survey of tools, I want to propose some initial general strategies that might be helpful to apply as the detail of the tools unfold
Coping with noise in a real-world weblog crawler and retrieval system
In this paper we examine the effects of noise when creating a real-world weblog corpus for information retrieval. We focus on the DiffPost (Lee et al. 2008) approach to noise removal from blog pages, examining the difficulties encountered when crawling the blogosphere during the creation of a real-world corpus of blog pages. We introduce and evaluate a number of enhancements to the original DiffPost approach in order to increase the robustness of the algorithm. We then extend DiffPost by looking at the anchor-text to text ratio, and dis- cover that the time-interval between crawls is more impor- tant to the successful application of noise-removal algorithms within the blog context, than any additional improvements to the removal algorithm itself
Design Patterns for Fusion-Based Object Retrieval
We address the task of ranking objects (such as people, blogs, or verticals)
that, unlike documents, do not have direct term-based representations. To be
able to match them against keyword queries, evidence needs to be amassed from
documents that are associated with the given object. We present two design
patterns, i.e., general reusable retrieval strategies, which are able to
encompass most existing approaches from the past. One strategy combines
evidence on the term level (early fusion), while the other does it on the
document level (late fusion). We demonstrate the generality of these patterns
by applying them to three different object retrieval tasks: expert finding,
blog distillation, and vertical ranking.Comment: Proceedings of the 39th European conference on Advances in
Information Retrieval (ECIR '17), 201
BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology
This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software
Realization of Semantic Atom Blog
Web blog is used as a collaborative platform to publish and share
information. The information accumulated in the blog intrinsically contains the
knowledge. The knowledge shared by the community of people has intangible value
proposition. The blog is viewed as a multimedia information resource available
on the Internet. In a blog, information in the form of text, image, audio and
video builds up exponentially. The multimedia information contained in an Atom
blog does not have the capability, which is required by the software processes
so that Atom blog content can be accessed, processed and reused over the
Internet. This shortcoming is addressed by exploring OWL knowledge modeling,
semantic annotation and semantic categorization techniques in an Atom blog
sphere. By adopting these techniques, futuristic Atom blogs can be created and
deployed over the Internet
Academics' online presence guidelines: A four step guide to taking control of your visibility
OpenUCT published Academics' online presence guidelines: A four step guide to taking control of your visibility in 2012
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
- …