187 research outputs found

    Building a semantic search engine with games and crowdsourcing

    Get PDF
    Semantic search engines aim at improving conventional search with semantic information, or meta-data, on the data searched for and/or on the searchers. So far, approaches to semantic search exploit characteristics of the searchers like age, education, or spoken language for selecting and/or ranking search results. Such data allow to build up a semantic search engine as an extension of a conventional search engine. The crawlers of well established search engines like Google, Yahoo! or Bing can index documents but, so far, their capabilities to recognize the intentions of searchers are still rather limited. Indeed, taking into account characteristics of the searchers considerably extend both, the quantity of data to analyse and the dimensionality of the search problem. Well established search engines therefore still focus on general search, that is, "search for all", not on specialized search, that is, "search for a few". This thesis reports on techniques that have been adapted or conceived, deployed, and tested for building a semantic search engine for the very specific context of artworks. In contrast to, for example, the interpretation of X-ray images, the interpretation of artworks is far from being fully automatable. Therefore artwork interpretation has been based on Human Computation, that is, a software-based gathering of contributions by many humans. The approach reported about in this thesis first relies on so called Games With A Purpose, or GWAPs, for this gathering: Casual games provide an incentive for a potentially unlimited community of humans to contribute with their appreciations of artworks. Designing convenient incentives is less trivial than it might seem at first. An ecosystem of games is needed so as to collect the meta-data on artworks intended for. One game generates the data that can serve as input of another game. This results in semantically rich meta-data that can be used for building up a successful semantic search engine. Thus, a first part of this thesis reports on a "game ecosystem" specifically designed from one known game and including several novel games belonging to the following game classes: (1) Description Games for collecting obvious and trivial meta-data, basically the well-known ESP (for extra-sensorial perception) game of Luis von Ahn, (2) the Dissemination Game Eligo generating translations, (3) the Diversification Game Karido aiming at sharpening differences between the objects, that is, the artworks, interpreted and (3) the Integration Games Combino, Sentiment and TagATag that generate structured meta-data. Secondly, the approach to building a semantic search engine reported about in this thesis relies on Higher-Order Singular Value Decomposition (SVD). More precisely, the data and meta-data on artworks gathered with the afore mentioned GWAPs are collected in a tensor, that is a mathematical structure generalising matrices to more than only two dimensions, columns and rows. The dimensions considered are the artwork descriptions, the players, and the artwork themselves. A Higher-Order SVD of this tensor is first used for noise reduction in This thesis reports also on deploying a Higher-Order LSA. The parallel Higher-Order SVD algorithm applied for the Higher-Order LSA and its implementation has been validated on an application related to, but independent from, the semantic search engine for artworks striven for: image compression. This thesis reports on the surprisingly good image compression which can be achieved with Higher-Order SVD. While compression methods based on matrix SVD for each color, the approach reported about in this thesis relies on one single (higher-order) SVD of the whole tensor. This results in both, better quality of the compressed image and in a significant reduction of the memory space needed. Higher-Order SVD is extremely time-consuming what calls for parallel computation. Thus, a step towards automatizing the construction of a semantic search engine for artworks was parallelizing the higher-order SVD method used and running the resulting parallel algorithm on a super-computer. This thesis reports on using Hestenes’ method and R-SVD for parallelising the higher-order SVD. This method is an unconventional choice which is explained and motivated. As of the super-computer needed, this thesis reports on turning the web browsers of the players or searchers into a distributed parallel computer. This is done by a novel specific system and a novel implementation of the MapReduce data framework to data parallelism. Harnessing the web browsers of the players or searchers saves computational power on the server-side. It also scales extremely well with the number of players or searchers because both, playing with and searching for artworks, require human reflection and therefore results in idle local processors that can be brought together into a distributed super-computer.Semantische Suchmaschinen dienen der Verbesserung konventioneller Suche mit semantischen Informationen, oder Metadaten, zu Daten, nach denen gesucht wird, oder zu den Suchenden. Bisher nutzt Semantische Suche Charakteristika von Suchenden wie Alter, Bildung oder gesprochene Sprache für die Auswahl und/oder das Ranking von Suchergebnissen. Solche Daten erlauben den Aufbau einer Semantischen Suchmaschine als Erweiterung einer konventionellen Suchmaschine. Die Crawler der fest etablierten Suchmaschinen wie Google, Yahoo! oder Bing können Dokumente indizieren, bisher sind die Fähigkeiten eher beschränkt, die Absichten von Suchenden zu erkennen. Tatsächlich erweitert die Berücksichtigung von Charakteristika von Suchenden beträchtlich beides, die Menge an zu analysierenden Daten und die Dimensionalität des Such-Problems. Fest etablierte Suchmaschinen fokussieren deswegen stark auf allgemeine Suche, also "Suche für alle", nicht auf spezialisierte Suche, also "Suche für wenige". Diese Arbeit berichtet von Techniken, die adaptiert oder konzipiert, eingesetzt und getestet wurden, um eine semantische Suchmaschine für den sehr speziellen Kontext von Kunstwerken aufzubauen. Im Gegensatz beispielsweise zur Interpretation von Röntgenbildern ist die Interpretation von Kunstwerken weit weg davon gänzlich automatisiert werden zu können. Deswegen basiert die Interpretation von Kunstwerken auf menschlichen Berechnungen, also Software-basiertes Sammeln von menschlichen Beiträgen. Der Ansatz, über den in dieser Arbeit berichtet wird, beruht auf sogenannten "Games With a Purpose" oder GWAPs die folgendes sammeln: Zwanglose Spiele bieten einen Anreiz für eine potenziell unbeschränkte Gemeinde von Menschen, mit Ihrer Wertschätzung von Kunstwerken beizutragen. Geeignete Anreize zu entwerfen in weniger trivial als es zuerst scheinen mag. Ein Ökosystem von Spielen wird benötigt, um Metadaten gedacht für Kunstwerke zu sammeln. Ein Spiel erzeugt Daten, die als Eingabe für ein anderes Spiel dienen können. Dies resultiert in semantisch reichhaltigen Metadaten, die verwendet werden können, um eine erfolgreiche Semantische Suchmaschine aufzubauen. Deswegen berichtet der erste Teil dieser Arbeit von einem "Spiel-Ökosystem", entwickelt auf Basis eines bekannten Spiels und verschiedenen neuartigen Spielen, die zu verschiedenen Spiel-Klassen gehören. (1) Beschreibungs-Spiele zum Sammeln offensichtlicher und trivialer Metadaten, vor allem dem gut bekannten ESP-Spiel (Extra Sensorische Wahrnehmung) von Luis von Ahn, (2) dem Verbreitungs-Spiel Eligo zur Erzeugung von Übersetzungen, (3) dem Diversifikations-Spiel Karido, das Unterschiede zwischen Objekten, also interpretierten Kunstwerken, schärft und (3) Integrations-Spiele Combino, Sentiment und Tag A Tag, die strukturierte Metadaten erzeugen. Zweitens beruht der Ansatz zum Aufbau einer semantischen Suchmaschine, wie in dieser Arbeit berichtet, auf Singulärwertzerlegung (SVD) höherer Ordnung. Präziser werden die Daten und Metadaten über Kunstwerk gesammelt mit den vorher genannten GWAPs in einem Tensor gesammelt, einer mathematischen Struktur zur Generalisierung von Matrizen zu mehr als zwei Dimensionen, Spalten und Zeilen. Die betrachteten Dimensionen sind die Beschreibungen der Kunstwerke, die Spieler, und die Kunstwerke selbst. Eine Singulärwertzerlegung höherer Ordnung dieses Tensors wird zuerst zur Rauschreduktion verwendet nach der Methode der sogenannten Latenten Semantischen Analyse (LSA). Diese Arbeit berichtet auch über die Anwendung einer LSA höherer Ordnung. Der parallele Algorithmus für Singulärwertzerlegungen höherer Ordnung, der für LSA höherer Ordnung verwendet wird, und seine Implementierung wurden validiert an einer verwandten aber von der semantischen Suche unabhängig angestrebten Anwendung: Bildkompression. Diese Arbeit berichtet von überraschend guter Kompression, die mit Singulärwertzerlegung höherer Ordnung erzielt werden kann. Neben Matrix-SVD-basierten Kompressionsverfahren für jede Farbe, beruht der Ansatz wie in dieser Arbeit berichtet auf einer einzigen SVD (höherer Ordnung) auf dem gesamten Tensor. Dies resultiert in beidem, besserer Qualität von komprimierten Bildern und einer signifikant geringeren des benötigten Speicherplatzes. Singulärwertzerlegung höherer Ordnung ist extrem zeitaufwändig, was parallele Berechnung verlangt. Deswegen war ein Schritt in Richtung Aufbau einer semantischen Suchmaschine für Kunstwerke eine Parallelisierung der verwendeten SVD höherer Ordnung auf einem Super-Computer. Diese Arbeit berichtet vom Einsatz der Hestenes’-Methode und R-SVD zur Parallelisierung der SVD höherer Ordnung. Diese Methode ist eine unkonventionell Wahl, die erklärt und motiviert wird. Ab nun wird ein Super-Computer benötigt. Diese Arbeit berichtet über die Wandlung der Webbrowser von Spielern oder Suchenden in einen verteilten Super-Computer. Dies leistet ein neuartiges spezielles System und eine neuartige Implementierung des MapReduce Daten-Frameworks für Datenparallelismus. Das Einspannen der Webbrowser von Spielern und Suchenden spart server-seitige Berechnungskraft. Ebenso skaliert die Berechnungskraft so extrem gut mit der Spieleranzahl oder Suchenden, denn beides, Spiel mit oder Suche nach Kunstwerken, benötigt menschliche Reflektion, was deswegen zu ungenutzten lokalen Prozessoren führt, die zu einem verteilten Super-Computer zusammengeschlossen werden können

    Spam elimination and bias correction : ensuring label quality in crowdsourced tasks.

    Get PDF
    Crowdsourcing is proposed as a powerful mechanism for accomplishing large scale tasks via anonymous workers online. It has been demonstrated as an effective and important approach for collecting labeled data in application domains which require human intelligence, such as image labeling, video annotation, natural language processing, etc. Despite the promises, one big challenge still exists in crowdsourcing systems: the difficulty of controlling the quality of crowds. The workers usually have diverse education levels, personal preferences, and motivations, leading to unknown work performance while completing a crowdsourced task. Among them, some are reliable, and some might provide noisy feedback. It is intrinsic to apply worker filtering approach to crowdsourcing applications, which recognizes and tackles noisy workers, in order to obtain high-quality labels. The presented work in this dissertation provides discussions in this area of research, and proposes efficient probabilistic based worker filtering models to distinguish varied types of poor quality workers. Most of the existing work in literature in the field of worker filtering either only concentrates on binary labeling tasks, or fails to separate the low quality workers whose label errors can be corrected from the other spam workers (with label errors which cannot be corrected). As such, we first propose a Spam Removing and De-biasing Framework (SRDF), to deal with the worker filtering procedure in labeling tasks with numerical label scales. The developed framework can detect spam workers and biased workers separately. The biased workers are defined as those who show tendencies of providing higher (or lower) labels than truths, and their errors are able to be corrected. To tackle the biasing problem, an iterative bias detection approach is introduced to recognize the biased workers. The spam filtering algorithm proposes to eliminate three types of spam workers, including random spammers who provide random labels, uniform spammers who give same labels for most of the items, and sloppy workers who offer low accuracy labels. Integrating the spam filtering and bias detection approaches into aggregating algorithms, which infer truths from labels obtained from crowds, can lead to high quality consensus results. The common characteristic of random spammers and uniform spammers is that they provide useless feedback without making efforts for a labeling task. Thus, it is not necessary to distinguish them separately. In addition, the removal of sloppy workers has great impact on the detection of biased workers, with the SRDF framework. To combat these problems, a different way of worker classification is presented in this dissertation. In particular, the biased workers are classified as a subcategory of sloppy workers. Finally, an ITerative Self Correcting - Truth Discovery (ITSC-TD) framework is then proposed, which can reliably recognize biased workers in ordinal labeling tasks, based on a probabilistic based bias detection model. ITSC-TD estimates true labels through applying an optimization based truth discovery method, which minimizes overall label errors by assigning different weights to workers. The typical tasks posted on popular crowdsourcing platforms, such as MTurk, are simple tasks, which are low in complexity, independent, and require little time to complete. Complex tasks, however, in many cases require the crowd workers to possess specialized skills in task domains. As a result, this type of task is more inclined to have the problem of poor quality of feedback from crowds, compared to simple tasks. As such, we propose a multiple views approach, for the purpose of obtaining high quality consensus labels in complex labeling tasks. In this approach, each view is defined as a labeling critique or rubric, which aims to guide the workers to become aware of the desirable work characteristics or goals. Combining the view labels results in the overall estimated labels for each item. The multiple views approach is developed under the hypothesis that workers\u27 performance might differ from one view to another. Varied weights are then assigned to different views for each worker. Additionally, the ITSC-TD framework is integrated into the multiple views model to achieve high quality estimated truths for each view. Next, we propose a Semi-supervised Worker Filtering (SWF) model to eliminate spam workers, who assign random labels for each item. The SWF approach conducts worker filtering with a limited set of gold truths available as priori. Each worker is associated with a spammer score, which is estimated via the developed semi-supervised model, and low quality workers are efficiently detected by comparing the spammer score with a predefined threshold value. The efficiency of all the developed frameworks and models are demonstrated on simulated and real-world data sets. By comparing the proposed frameworks to a set of state-of-art methodologies, such as expectation maximization based aggregating algorithm, GLAD and optimization based truth discovery approach, in the domain of crowdsourcing, up to 28.0% improvement can be obtained for the accuracy of true label estimation

    Libro de Actas JCC&BD 2018 : VI Jornadas de Cloud Computing & Big Data

    Get PDF
    Se recopilan las ponencias presentadas en las VI Jornadas de Cloud Computing & Big Data (JCC&BD), realizadas entre el 25 al 29 de junio de 2018 en la Facultad de Informática de la Universidad Nacional de La Plata.Universidad Nacional de La Plata (UNLP) - Facultad de Informátic

    Quality of Information in Mobile Crowdsensing: Survey and Research Challenges

    Full text link
    Smartphones have become the most pervasive devices in people's lives, and are clearly transforming the way we live and perceive technology. Today's smartphones benefit from almost ubiquitous Internet connectivity and come equipped with a plethora of inexpensive yet powerful embedded sensors, such as accelerometer, gyroscope, microphone, and camera. This unique combination has enabled revolutionary applications based on the mobile crowdsensing paradigm, such as real-time road traffic monitoring, air and noise pollution, crime control, and wildlife monitoring, just to name a few. Differently from prior sensing paradigms, humans are now the primary actors of the sensing process, since they become fundamental in retrieving reliable and up-to-date information about the event being monitored. As humans may behave unreliably or maliciously, assessing and guaranteeing Quality of Information (QoI) becomes more important than ever. In this paper, we provide a new framework for defining and enforcing the QoI in mobile crowdsensing, and analyze in depth the current state-of-the-art on the topic. We also outline novel research challenges, along with possible directions of future work.Comment: To appear in ACM Transactions on Sensor Networks (TOSN

    CrowdBC: A blockchain-based decentralized framework for crowdsourcing

    Get PDF
    Crowdsourcing systems which utilize the human intelligence to solve complex tasks have gained considerable interest and adoption in recent years. However, the majority of existing crowdsourcing systems rely on central servers, which are subject to the weaknesses of traditional trust-based model, such as single point of failure. They are also vulnerable to distributed denial of service (DDoS) and Sybil attacks due to malicious users involvement. In addition, high service fees from the crowdsourcing platform may hinder the development of crowdsourcing. How to address these potential issues has both research and substantial value. In this paper, we conceptualize a blockchain-based decentralized framework for crowdsourcing named CrowdBC, in which a requester’s task can be solved by a crowd of workers without relying on any third trusted institution, users’ privacy can be guaranteed and only low transaction fees are required. In particular, we introduce the architecture of our proposed framework, based on which we give a concrete scheme. We further implement a software prototype on Ethereum public test network with real-world dataset. Experiment results show the feasibility, usability and scalability of our proposed crowdsourcing system
    • …
    corecore