24 research outputs found

    The Role of Cores in Recommender Benchmarking for Social Bookmarking Systems

    Get PDF
    Social bookmarking systems have established themselves as an important part in today’s Web. In such systems, tag recommender systems support users during the posting of a resource by suggesting suitable tags. Tag recommender algorithms have often been evaluated in offline benchmarking experiments. Yet, the particular setup of such experiments has rarely been analyzed. In particular, since the recommendation quality usually suffers from difficulties such as the sparsity of the data or the cold-start problem for new resources or users, datasets have often been pruned to so-called cores (specific subsets of the original datasets), without much consideration of the implications on the benchmarking results. In this article, we generalize the notion of a core by introducing the new notion of a set-core, which is independent of any graph structure, to overcome a structural drawback in the previous constructions of cores on tagging data. We show that problems caused by some types of cores can be eliminated using set-cores. Further, we present a thorough analysis of tag recommender benchmarking setups using cores. To that end, we conduct a large-scale experiment on four real-world datasets, in which we analyze the influence of different cores on the evaluation of recommendation algorithms. We can show that the results of the comparison of different recommendation approaches depends on the selection of core type and level. For the benchmarking of tag recommender algorithms, our results suggest that the evaluation must be set up more carefully and should not be based on one arbitrarily chosen core type and level

    An evaluation of recommendation algorithms for online recipe portals

    Get PDF
    Better models of food preferences are required to realise the oft touted potential of food recommenders to aid with the obesity crisis. Many of the food recommender evaluations in the literature have been performed with small convenience samples, which limits our conidence in the generalisability of the results. In this work we test a range of collaborative iltering (CF) and content-based (CB) recommenders on a large dataset crawled from the web consisting of naturalistic user interaction data over a 15 year period. The results reveal strengths and limitations of diferent approaches. While CF approaches consistently outperform CB approaches when testing on the complete dataset, our experiments show that to improve on CF methods require a large number of users (> 637 when sampling randomly). Moreover the results show diferent facets of recipe content to ofer utility. In particular one of the strongest content related features was a measure of health derived from guidelines from the UK Food Safety Agency. This inding underlines the challenges we face as a community to develop recommender algorithms, which improve the healthfulness of the food people choose to eat.publishedVersio

    Community Detection in Hypergraphen

    Get PDF
    Viele Datensätze können als Graphen aufgefasst werden, d.h. als Elemente (Knoten) und binäre Verbindungen zwischen ihnen (Kanten). Unter dem Begriff der "Complex Network Analysis" sammeln sich eine ganze Reihe von Verfahren, die die Untersuchung von Datensätzen allein aufgrund solcher struktureller Eigenschaften erlauben. "Community Detection" als Untergebiet beschäftigt sich mit der Identifikation besonders stark vernetzter Teilgraphen. Über den Nutzen hinaus, den eine Gruppierung verwandter Element direkt mit sich bringt, können derartige Gruppen zu einzelnen Knoten zusammengefasst werden, was einen neuen Graphen von reduzierter Komplexität hervorbringt, der die Makrostruktur des ursprünglichen Graphen unter Umständen besser hervortreten lässt. Fortschritte im Bereich der "Community Detection" verbessern daher auch das Verständnis komplexer Netzwerke im allgemeinen. Nicht jeder Datensatz lässt sich jedoch angemessen mit binären Relationen darstellen - Relationen höherer Ordnung führen zu sog. Hypergraphen. Gegenstand dieser Arbeit ist die Verallgemeinerung von Ansätzen zur "Community Detection" auf derartige Hypergraphen. Im Zentrum der Aufmerksamkeit stehen dabei "Social Bookmarking"-Datensätze, wie sie von Benutzern von "Bookmarking"-Diensten erzeugt werden. Dabei ordnen Benutzer Dokumenten frei gewählte Stichworte, sog. "Tags" zu. Dieses "Tagging" erzeugt, für jede Tag-Zuordnung, eine ternäre Verbindung zwischen Benutzer, Dokument und Tag, was zu Strukturen führt, die 3-partite, 3-uniforme (im folgenden 3,3-, oder allgemeiner k,k-) Hypergraphen genannt werden. Die Frage, der diese Arbeit nachgeht, ist wie diese Strukturen formal angemessen in "Communities" unterteilt werden können, und wie dies das Verständnis dieser Datensätze erleichtert, die potenziell sehr reich an latenten Informationen sind. Zunächst wird eine Verallgemeinerung der verbundenen Komponenten für k,k-Hypergraphen eingeführt. Die normale Definition verbundener Komponenten weist auf den untersuchten Datensätzen, recht uninformativ, alle Elemente einer einzelnen Riesenkomponente zu. Die verallgemeinerten, so genannten hyper-inzidenten verbundenen Komponenten hingegen zeigen auf den "Social Bookmarking"-Datensätzen eine charakteristische Größenverteilung, die jedoch bspw. von Spam-Verhalten zerstört wird - was eine Verbindung zwischen Verhaltensmustern und strukturellen Eigenschaften zeigt, der im folgenden weiter nachgegangen wird. Als nächstes wird das allgemeine Thema der "Community Detection" auf k,k-Hypergraphen eingeführt. Drei Herausforderungen werden definiert, die mit der naiven Anwendung bestehender Verfahren nicht gemeistert werden können. Außerdem werden drei Familien synthetischer Hypergraphen mit "Community"-Strukturen von steigender Komplexität eingeführt, die prototypisch für Situationen stehen, die ein erfolgreicher Detektionsansatz rekonstruieren können sollte. Der zentrale methodische Beitrag dieser Arbeit besteht aus der im folgenden dargestellten Entwicklung eines multipartiten (d.h. für k,k-Hypergraphen geeigneten) Verfahrens zur Erkennung von "Communities". Es basiert auf der Optimierung von Modularität, einem etablierten Verfahrung zur Erkennung von "Communities" auf nicht-partiten, d.h. "normalen" Graphen. Ausgehend vom einfachst möglichen Ansatz wird das Verfahren iterativ verfeinert, um den zuvor definierten sowie neuen, in der Praxis aufgetretenen Herausforderungen zu begegnen. Am Ende steht die Definition der "ausgeglichenen multi-partiten Modularität". Schließlich wird ein interaktives Werkzeug zur Untersuchung der so gewonnenen "Community"-Zuordnungen vorgestellt. Mithilfe dieses Werkzeugs können die Vorteile der zuvor eingeführten Modularität demonstriert werden: So können komplexe Zusammenhänge beobachtet werden, die den einfacheren Verfahren entgehen. Diese Ergebnisse werden von einer stärker quantitativ angelegten Untersuchung bestätigt: Unüberwachte Qualitätsmaße, die bspw. den Kompressionsgrad berücksichtigen, können über eine größere Menge von Beispielen die Vorteile der ausgeglichenen multi-partiten Modularität gegenüber den anderen Verfahren belegen. Zusammenfassend lassen sich die Ergebnisse dieser Arbeit in zwei Bereiche einteilen: Auf der praktischen Seite werden Werkzeuge zur Erforschung von "Social Bookmarking"-Daten bereitgestellt. Demgegenüber stehen theoretische Beiträge, die für Graphen etablierte Konzepte - verbundene Komponenten und "Community Detection" - auf k,k-Hypergraphen übertragen.Many datasets can be interpreted as graphs, i.e. as elements (nodes) and binary relations between them (edges). Under the label of complex network analysis, a vast array of graph-based methods allows the exploration of datasets purely based on such structural properties. Community detection, as a subfield of network analysis, aims to identify well-connected subparts of graphs. While the grouping of related elements is useful in itself, these groups can furthermore be collapsed into single nodes, creating a new graph of reduced complexity which may better reveal the original graph's macrostructure. Therefore, advances in community detection improve the understanding of complex networks in general. However, not every dataset can be modelled properly with binary relations - higher-order relations give rise to so-called hypergraphs. This thesis explores the generalization of community detection approaches to hypergraphs. In the focus of attention are social bookmarking datasets, created by users of online bookmarking services who assign freely chosen keywords, so-called "tags", to documents. This "tagging" creates, for each tag assignment, a ternary connection between the user, the document, and the tag, inducing particular structures called 3-partite, 3-uniform hypergraphs (henceforth called 3,3- or more generally k,k-hypergraphs). The question pursued here is how to decompose these structures in a formally adequate manner, and how this improves the understanding of these rich datasets. First, a generalization of connected components to k,k-hypergraphs is proposed. The standard definition of connected components here rather uninformatively assigns almost all elements to a single giant component. The generalized so-called hyperincident connected components, however, show a characteristic size distribution on the social bookmarking datasets that is disrupted by, e.g., spamming activity - demonstrating a link between behavioural patterns and structural features that is further explored in the following. Next, the general topic of community detection in k,k-hypergraphs is introduced. Three challenges are posited that are not met by the naive application of standard techniques, and three families of synthetic hypergraphs are introduced containing increasingly complex community setups that a successful detection approach must be able to identify. The main methodical contribution of this thesis consists of the following development of a multi-partite (i.e. suitable for k,k-hypergraphs) community detection algorithm. It is based on modularity optimization, a well-established algorithm to detect communities in non-partite, i.e. "normal" graphs. Starting from the simplest approach possible, the method is successively refined to meet the previously defined as well as empirically encountered challenges, culminating in the definition of the "balanced multi-partite modularity". Finally, an interactive tool for exploring the obtained community assignments is introduced. Using this tool, the benefits of balanced multi-partite modularity can be shown: Intricate patters can be observed that are missed by the simpler approaches. These findings are confirmed by a more quantitative examination: Unsupervised quality measures considering, e.g., compression document the advantages of this approach on a larger number of samples. To conclude, the contributions of this thesis are twofold. It provides practical tools for the analysis of social bookmarking data, complemented with theoretical contributions, the generalization of connected components and modularity from graphs to k,k-hypergraphs

    The Challenges of Big Data - Contributions in the Field of Data Quality and Artificial Intelligence Applications

    Get PDF
    The term "big data" has been characterized by challenges regarding data volume, velocity, variety and veracity. Solving these challenges requires research effort that fits the needs of big data. Therefore, this cumulative dissertation contains five paper aiming at developing and applying AI approaches within the field of big data as well as managing data quality in big data

    Learning Explainable User Sentiment and Preferences for Information Filtering

    Get PDF
    In the last decade, online social networks have enabled people to interact in many ways with each other and with content. The digital traces of such actions reveal people's preferences towards online content such as news or products. These traces often result from interactions such as sharing or liking, but also from interactions in natural language. The continuous growth of the amount of content and of digital traces has led to information overload: surrounded by large volumes of information, people are facing difficulties when searching for information relevant to their interests. To improve user experience, information systems must be able to assist users in achieving their search goals, effectively and efficiently. This thesis is concerned with two important challenges that information systems need to address in order to significantly improve search experience and overcome information overload. First, these systems need to model accurately the variety of user traces, and second, they need to meaningfully explain search results and recommendations to users. To address these challenges, this thesis proposes novel methods based on machine learning to model user sentiment and preferences for information filtering systems, which are effective, scalable, and easily interpretable by humans. We focus on two prominent types of user traces in social networks: on the one hand, user comments accompanied by unary preferences such as likes, and on the other hand, user reviews accompanied by numerical preferences such as star ratings. In both cases, we advocate that by better understanding user text through mining its semantics and modeling its structure, we can not only improve information filtering, but also explain predictions to users. Within this context, we aim to answer three main research questions, namely: (i)~how do item semantics help to predict unary preferences; (ii)~how do sentiments of free-form user texts help to predict unary preferences; and (iii)~how to model fine-grained numerical preferences from user review texts. Our goal is to model and extract from user text the knowledge required to answer these questions, and to obtain insights on how to design better information filtering systems that are more effective and improve user experience. To answer the first question, we formulate the recommendation problem based on unary preferences as a top-N retrieval task and we define an appropriate dataset and metrics for measuring performance. Then, we propose and evaluate several content-based methods based on semantic similarities under presence or absence of preferences. To answer the second question, we propose a sentiment-aware neighborhood model which integrates the sentiment of user comments with unary preferences, either through fixed or through learned mapping functions. For the latter type, we propose a learning algorithm which adapts the sentiment of user comments to unary preferences at collective or individual levels. To answer the third question, we cast the problem of modeling user attitude toward aspects of items as a weakly supervised problem, and we propose a weighted multiple-instance learning method for solving it. Lastly, we show that the learned saliency weights, apart from being easily interpretable, are useful indicators for review segmentation and summarization

    Artificial Intelligence for Online Review Platforms - Data Understanding, Enhanced Approaches and Explanations in Recommender Systems and Aspect-based Sentiment Analysis

    Get PDF
    The epoch-making and ever faster technological progress provokes disruptive changes and poses pivotal challenges for individuals and organizations. In particular, artificial intelligence (AI) is a disruptive technology that offers tremendous potential for many fields such as information systems and electronic commerce. Therefore, this dissertation contributes to AI for online review platforms aiming at enabling the future for consumers, businesses and platforms by unveiling the potential of AI. To achieve this goal, the dissertation investigates six major research questions embedded in the triad of data understanding of online consumer reviews, enhanced approaches and explanations in recommender systems and aspect-based sentiment analysis

    Classification algorithms for Big Data with applications in the urban security domain

    Get PDF
    A classification algorithm is a versatile tool, that can serve as a predictor for the future or as an analytical tool to understand the past. Several obstacles prevent classification from scaling to a large Volume, Velocity, Variety or Value. The aim of this thesis is to scale distributed classification algorithms beyond current limits, assess the state-of-practice of Big Data machine learning frameworks and validate the effectiveness of a data science process in improving urban safety. We found in massive datasets with a number of large-domain categorical features a difficult challenge for existing classification algorithms. We propose associative classification as a possible answer, and develop several novel techniques to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. The experiments, run on a real large-scale dataset with more than 4 billion records, confirmed the quality of the approach. To assess the state-of-practice of Big Data machine learning frameworks and streamline the process of integration and fine-tuning of the building blocks, we developed a generic, self-tuning tool to extract knowledge from network traffic measurements. The result is a system that offers human-readable models of the data with minimal user intervention, validated by experiments on large collections of real-world passive network measurements. A good portion of this dissertation is dedicated to the study of a data science process to improve urban safety. First, we shed some light on the feasibility of a system to monitor social messages from a city for emergency relief. We then propose a methodology to mine temporal patterns in social issues, like crimes. Finally, we propose a system to integrate the findings of Data Science on the citizenry’s perception of safety and communicate its results to decision makers in a timely manner. We applied and tested the system in a real Smart City scenario, set in Turin, Italy
    corecore