25 research outputs found
Recommended from our members
WIDIT in TREC-2005 HARD, Robust, and SPAM tracks
Web Information Discovery Tool (WIDIT) Laboratory at the Indiana University School of Library and Information Science participated in the HARD, Robust, and SPAM tracks in TREC- 2005. The basic approach of WIDIT is to combine multiple methods as well as to leverage multiple sources of evidence. Our main strategies for the tracks were: query expansion and fusion optimization for the HARD and Robust tracks; and combination of probabilistic, rule-based, pattern-based, and blacklist email filters for the SPAM track
Web Page Retrieval by Combining Evidence
The participation of the REINA Research Group in WebCLEF 2005 focused in the monolingual mixed task. Queries or topics are of two types: named and home pages. For both, we first perform a search by thematic contents; for the same query, we do a search in several elements of information from every page (title, some meta tags, anchor text) and then we combine the results. For queries about home pages, we try to detect using a method based in some keywords and their patterns of use. After, a re-rank of the results of the thematic contents retrieval is performed, based on Page-Rank and Centrality coeficients
Recommended from our members
WIDIT in TREC-2006 Blog track
Web Information Discovery Integrated Tool (WIDIT) Laboratory at the Indiana University School of Library and Information Science participated in the Blog trackâs opinion task in TREC- 2006. The goal of opinion task is to "uncover the public sentiment towards a given entity/target", which involves not only retrieving topically relevant blogs but also identifying those that contain opinions about the target. To further complicate the matter, the blog test collection contains considerable amount of noise, such as blogs with non-English content and non-blog content (e.g., advertisement, navigational text), which may misdirect retrieval systems.
Based on our hypothesis that noise reduction (e.g., exclusion of non-English blogs, navigational text) will improve both on-topic and opinion retrieval performances, we explored various noise reduction approaches that can effectively eliminate the noise in blog data without inadvertently excluding valid content. After creating two separate indexes (with and without noise) to assess the noise reduction effect, we tackled the opinion blog retrieval task by breaking it down to two sequential subtasks: on-topic retrieval followed by opinion classification. Our opinion retrieval approach was to first apply traditional IR methods to retrieve on-topic blogs, and then boost the ranks of opinionated blogs based on opinion scores generated by opinion assessment methods. Our opinion module consists of Opinion Term Module, which identify opinions based on the frequency of opinion terms (i.e., terms that only occur frequently in opinion blogs), Rare Term Module, which uses uncommon/rare terms (e.g., âsooo goodâ) for opinion classification, IU Module, which uses IU (I and you) collocations, and Adjective-Verb Module, which uses computational linguisticsâ distribution similarity approach to learn the subjective language from training data
Recommended from our members
WIDIT in TREC-2007 Blog Track: Combining Lexicon-based Methods to Detect Opinionated Blogs
In TREC-2007, Indiana Universityâs WIDIT Lab1 participated in the Blog trackâs opinion task and the polarity subtask. For the opinion task, whose goal is to "uncover the public sentiment towards a given entity/target", we focused on combining multiple sources of evidence to detect opinionated blog postings. Since detecting opinionated blogs on a given topic (i.e., entity/target) involves not only retrieving topically relevant blogs but also identifying those that contain opinions about the target, our approach to the opinion finding task consisted of first applying traditional IR methods to retrieve on-topic blogs and then boosting the ranks of opinionated blogs based on combined opinion scores generated by multiple opinion detection methods. The key idea underlying our opinion detection method is to rely on a variety of complementary evidences rather than trying to optimize a single approach. This fusion approach to opinionated blog detection is motivated by our past experience that suggested no single approach, whether lexicon-based or classifier-driven, is well-suited for the blog opinion retrieval task. To accomplish the polarity subtask, which requires classification of the retrieved blogs into positive or negative orientation, our opinion detection module was extended to generate polarity scores to be used for polarity determination
Recommended from our members
Fusion Approach to Finding Opinionated Blogs
In this paper, we describe a fusion approach to finding opinionated blog postings. Our approach to opinion blog retrieval consisted of first applying traditional IR methods to retrieve on-topic blogs and then boosting the ranks of opinionated blogs based on combined opinion scores generated by multiple assessment methods. Our opinion module is composed of the Opinion Term Module, which identifies opinions based on the frequency of opinion terms (i.e., terms that occur frequently in opinion blogs), the Rare Term Module, which uses uncommon/rare terms (e.g., âsooo goodâ) for opinion classification, the IU Module, which uses IU (I and you) collocations, and the Adjective-Verb Module, which uses computational linguisticsâ distribution similarity approach to learn the subjective language from training data
Semi-Supervised Learning For Identifying Opinions In Web Content
Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented
Recommended from our members
Fusion Approach to Finding Opinions in Blogosphere
In this paper, we describe a fusion approach to finding opinion about a given target in blog postings. We tackled the opinion blog retrieval task by breaking it down to two sequential subtasks: on- topic retrieval followed by opinion classification. Our opinion retrieval approach was to first apply traditional IR methods to retrieve on-topic blogs, and then boost the ranks of opinionated blogs using combined opinion scores generated by four opinion assessment methods. Our opinion module consists of Opinion Term Module, which identify opinions based on the frequency of opinion terms (i.e., terms that only occur frequently in opinion blogs), Rare Term Module, which uses uncommon/rare terms (e.g., âsooo goodâ) for opinion classification, IU Module, which uses IU (I and you) collocations, and Adjective-Verb Module, which uses computational linguisticsâ distribution similarity approach to learn the subjective language from training data.This paper was presented by the author(s) at the International Conference on Weblogs and Social Media on March 27, 2007, in Boulder, Colorado, U.S.A. This paper has also been published as: Yang, K., Yu, N., Valerio, A., Zhang, H., & Ke, W. (2007). Fusion approach to finding opinionated blogs. Proceedings of the American Society for Information Science and Technology, 44(1), 1â14. doi: 10.1002/meet.1450440254Keywords: Opinion Identification, Method Fusion, Rank-boosting, Dynamic Tunin
Opinion mining: Reviewed from word to document level
International audienceOpinion mining is one of the most challenging tasks of the field of information retrieval. Research community has been publishing a number of articles on this topic but a significant increase in interest has been observed during the past decade especially after the launch of several online social networks. In this paper, we provide a very detailed overview of the related work of opinion mining. Following features of our review make it stand unique among the works of similar kind: (1) it presents a very different perspective of the opinion mining field by discussing the work on different granularity levels (like word, sentences, and document levels) which is very unique and much required, (2) discussion of the related work in terms of challenges of the field of opinion mining, (3) document level discussion of the related work gives an overview of opinion mining task in blogosphere, one of most popular online social network, and (4) highlights the importance of online social networks for opinion mining task and other related sub-tasks
Effectiveness gain of polarity detection through topic domains
National audienceMost of the work on polarity detection consists in finding out negative or positive words in a document using sentiment lexical resources. Indeed, some versions of such approaches have performed well but most of these approaches rely only on prior polarity of words and do not exploit the contextual polarity of words. Sentiment semantics of a term vary from one domain to another. For example, the word "unpredictable" conveys a positive feeling about a movie plot, but the same word conveys negative feeling in context of operating of a digital camera. In this work, we demonstrate this aspect of sentiment polarity. We use TREC Blog 2006 Data collection with topics of TREC Blog 2006 and 2007 for experimentation. The results of our experiments showed an improvement (95%) on polarity detection. The conclusion is that the context plays a role on the polarity of each word
Distinguishing the Popularity Between Topics: A System for Up-to-date Opinion Retrieval and Mining in the Web
The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers