39 research outputs found
Analyzing and Predicting Sentiment of Images on the Social Web
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution of images, and make use of the SentiWordNet thesaurus to extract numerical values for their sentiment from accompanying textual metadata. We then perform a discriminative feature analysis based on information theoretic methods, and apply machine learning techniques to predict the sentiment of images. Our large-scale empirical study on a set of over half a million Flickr images shows a considerable correlation between sentiment and visual features, and promising results towards estimating the polarity of sentiment in images
Combination Methods for Automatic Document Organization
Automatic document classification and clustering are useful for a wide range of applications such as organizing Web, intranet, or portal pages into topic directories, filtering news feeds or mail, focused crawling on the Web or in intranets, and many more. This thesis presents ensemble-based meta methods for supervised learning (i.e., classification based on a small amount of hand-annotated training documents). In addition, we show how these techniques can be carried forward to clustering based on unsupervised learning (i.e., automatic structuring of document corpora without training data). The algorithms are applied in a restrictive manner, i.e., by leaving out some \u27uncertain\u27 documents (rather than assigning them to inappropriate topics or clusters with low confidence). We show how restrictive meta methods can be used to combine different document representations in the context of Web document classification and author recognition. As another application for meta methods we study the combination of difierent information sources in distributed environments, such as peer-to-peer information systems. Furthermore we address the problem of semi-supervised classification on document collections using retraining. A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. The results of our systematic evaluation on real world data show the viability of the proposed approaches.Automatische Dokumentklassifikation und Clustering sind fĂĽr eine Vielzahl von Anwendungen von Bedeutung, wie beispielsweise Organisation von Web-, Intranet- oder Portalseiten in thematische Verzeichnisse, Filterung von Nachrichtenmeldungen oder Emails, fokussiertes Crawling im Web oder in Intranets und vieles mehr. Diese Arbeit untersucht Ensemble-basierte Metamethoden fĂĽr Supervised Learning (d.h. Klassifikation basierend auf einer kleinen Anzahl von manuell annotierten Trainingsdokumenten).
Weiterhin zeigen wir, wie sich diese Techniken auf Clustering basierend auf
Unsupervised Learning (d.h. die automatische Strukturierung von Dokumentkorpora
ohne Trainingsdaten) ĂĽbertragen lassen. Dabei wenden wir die Algorithmen in restriktiver Form an, d.h. wir treffen keine Aussage ĂĽber eine Teilmenge von "unsicheren" Dokumenten (anstatt sie mit niedriger Konfidenz ungeeigneten Themen oder Clustern
zuzuordnen).
Wir verwendenen restriktive Metamethoden um unterschiedliche Dokumentrepräsentationen, im Kontext der Klassifikation von Webdokumentem und der Autorenerkennung,
miteinander zu kombinieren. Als weitere Anwendung von Metamethoden
untersuchen wir die Kombination von unterschiedlichen Informationsquellen in
verteilten Umgebungen wie Peer-to-Peer Informationssystemen. Weiterhin betrachten
wir das Problem der Semi-Supervised Klassifikation von Dokumentsammlungen durch
Retraining. Eine mögliche Anwendung ist fokussiertesWeb Crawling, wo wir mit sehr
wenigen, manuell ausgewählten Trainingsdokumenten starten, die durch Hinzufugen
von ursprünglich nicht klassifizierten Dokumenten ergänzt werden.
Die Resultate unserer systematischen Evaluation auf realen Daten zeigen das gute
Leistungsverhalten unserer Methoden
PicAlert!: a system for privacy-aware image classification and retrieval
Photo publishing in Social Networks and other Web2.0 applications has become very popular due to the pervasive availability of cheap digital cameras, powerful batch upload tools and a huge amount of storage space. A portion of uploaded images are of a highly sensitive nature, disclosing many details of the users’ private life. We have developed a web service which can detect private images within a user’s photo stream and provide support in making privacy decisions in the sharing context. In addition, we present a privacy-oriented image search application which automatically identifies potentially sensitive images in the result set and separates them from the remaining picture
Authors ’ Addresses
This technical report addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or co-authorship among bibliographic data entries). Our approach is based on iterative relaxation of cluster assignments, and can be built on top of any clustering algorithm (e.g., k-means or DBSCAN). These techniques result in higher cluster purity, better overall accuracy, and make self-organization more robust. Our comprehensive experiments on three different real-worl
Authors ’ Addresses
This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach. Chapter
Dear search engine: what’s your opinion about...?: sentiment analysis for semantic enrichment of web search results
Search Engines have become the main entry point to Web content, and a large part of the “visible ” Web consists in what is presented by them as top retrieved results. Therefore, it would be desirable if the first few results were a representative sample of the entire result set. This paper provides a preliminary study about opinions contained in search engine results for controversial queries such as “cloning ” or “immigration”. To this end, we extract sentiment metadata from web pages, and compare search engine results for several queries. Furthermore, we compare opinions expressed in the top results to those in other retrieved results to examine whether the top-ranked pages are a good sample of all results from an opinion perspective. In a preliminary empirical analysis, we compare up to 50 results from 3 commercial search engines on 14 controversial queries to study the relation between sentiments, topics, and rankings
Restrictive clustering and metaclustering for self-organizing document collections
This paper addresses the problem of automatically structuring heterogenous document collections by using clustering methods. In contrast to traditional clustering, we study restrictive methods and ensemble-based meta methods that may decide to leave out some documents rather than assigning them to inappropriate clusters with low confidence. These techniques result in higher cluster purity, better overall accuracy, and make unsupervised self-organization more robust. Our comprehensive experimental studies on three different real-world data collections demonstrate these benefits. The proposed methods seem particularly suitable for automatically substructuring personal email folders or personal Web directories that are populated by focused crawlers, and they can be combined with supervised classification techniques
Social recommender systems for web 2.0 folksonomies
The rapidly increasing popularity of Web 2.0 knowledge and content sharing systems and growing amount of shared data make discovering relevant content and finding contacts a difficult enterprize. Typically, folksonomies provide a rich set of structures and social relationships that can be mined for a variety of recommendation purposes. In this paper we propose a formal model to characterize users, items, and annotations in Web 2.0 environments. Our objective is to construct social recommender systems that predict the utility of items, users, or groups based on the multi-dimensional social environment of a given user. Based on this model we introduce recommendation mechanisms for content sharing frameworks. Our comprehensive evaluation shows the viability of our approach and emphasizes the key role of social meta knowledge for constructing effective recommendations in Web 2.0 applications