Search CORE

89 research outputs found

Engineering Crowdsourced Stream Processing Systems

Author: Carlos Castillo
Crp Henri Tudor
Ioanna Lykourentzou
Muhammad Imran
Yannick Naudet
Publication venue
Publication date: 04/08/2014
Field of study

A crowdsourced stream processing system (CSP) is a system that incorporates crowdsourced tasks in the processing of a data stream. This can be seen as enabling crowdsourcing work to be applied on a sample of large-scale data at high speed, or equivalently, enabling stream processing to employ human intelligence. It also leads to a substantial expansion of the capabilities of data processing systems. Engineering a CSP system requires the combination of human and machine computation elements. From a general systems theory perspective, this means taking into account inherited as well as emerging properties from both these elements. In this paper, we position CSP systems within a broader taxonomy, outline a series of design principles and evaluation metrics, present an extensible framework for their design, and describe several design patterns. We showcase the capabilities of CSP systems by performing a case study that applies our proposed framework to the design and analysis of a real system (AIDR) that classifies social media messages during time-critical crisis events. Results show that compared to a pure stream processing system, AIDR can achieve a higher data classification accuracy, while compared to a pure crowdsourcing solution, the system makes better use of human workers by requiring much less manual work effort

arXiv.org e-Print Archive

CiteSeerX

Revisiting Prompt Engineering via Declarative Crowdsourcing

Author: Asawa Parth
Jain Naman
Parameswaran Aditya G.
Shankar Shreya
Wang Yujie
Publication venue
Publication date: 07/08/2023
Field of study

Large language models (LLMs) are incredibly powerful at comprehending and generating data in the form of text, but are brittle and error-prone. There has been an advent of toolkits and recipes centered around so-called prompt engineering-the process of asking an LLM to do something via a series of prompts. However, for LLM-powered data processing workflows, in particular, optimizing for quality, while keeping cost bounded, is a tedious, manual process. We put forth a vision for declarative prompt engineering. We view LLMs like crowd workers and leverage ideas from the declarative crowdsourcing literature-including leveraging multiple prompting strategies, ensuring internal consistency, and exploring hybrid-LLM-non-LLM approaches-to make prompt engineering a more principled process. Preliminary case studies on sorting, entity resolution, and imputation demonstrate the promise of our approac

arXiv.org e-Print Archive

Recommended from our members

Robust Algorithms for Clustering with Applications to Data Integration

Author: Galhotra Sainyam
Publication venue: ScholarWorks@UMass Amherst
Publication date: 20/10/2021
Field of study

A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and scalable methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below. 1. While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of clustering methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. Additionally, we focus on scalability to handle web-scale datasets. 2. In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model that considers dependent edge formation and devise techniques for efficient cluster recovery

ScholarWorks@UMass Amherst

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

Author: Bontcheva Kalina
Sabou Marta
Scharl Arno
Publication venue
Publication date: 01/01/2012
Field of study

Although the field has led to promising early results, the use of crowdsourcing as an integral part of science projects is still regarded with skepticism by some, largely due to a lack of awareness of the opportunities and implications of utilizing these new techniques. We address this lack of awareness, firstly by highlighting the positive impacts that crowdsourcing has had on Natural Language Processing research. Secondly, we discuss the challenges of more complex methodologies, quality control, and the necessity to deal with ethical issues. We conclude with future trends and opportunities of crowdsourcing for science, including its potential for disseminating results, making science more accessible, and enriching educational programs

CiteSeerX

webLyzard technology gmbh

Sustaining Glasgow's Urban Networks: the Link Communities of Complex Urban Systems

Author: Itova I.
Itova I.
Publication venue: University of Westminster
Publication date: 01/01/2022
Field of study

As cities grow in population size and became more crowded (UN DESA, 2018), the main future challenges around the world will remain to be accommodating the growing urban population while drastically reducing environmental pressure. Contemporary urban agglomerations (large or small) constantly impose burden on the natural environment by conveying ecosystem services to close and distant places, through coupled human nature [infrastructure] systems (CHANS). Tobler’s first law in geography (1970) that states that “everything is related to everything else, but near things are more related than distant things” is now challenged by globalization. When this law was first established, the hypothesis referred to geological processes (Campbell and Shin, 2012, p.194) that were predominantly observed in pre-globalized economy, where freight was costly and mainly localized (Zhang et al., 2018). With the recent advances and modernisation made in transport technologies, most of them in the sea and air transportation (Zhang et al., 2018) and the growth of cities in population, natural resources and bi-products now travel great distances to infiltrate cities (Neuman, 2006) and satisfy human demands. Technical modernisation and the global hyperconnectivity of human interactions and trading, in the last thirty years alone resulted with staggering 94 per cent growth of resource extraction and consumption (Giljum et al., 2015). Local geographies (Kennedy, Cuddihy and Engel-Yan, 2007) will remain affected by global urbanisation (Giljum et al., 2015), and as a corollary, the operational inefficiencies of their local infrastructure networks, will contribute even more to the issues of environmental unsustainability on a global scale. Another challenge for future city-regions is the equity of public infrastructure services and policy creation that promote the same (Neuman and Hull, 2009). Public infrastructure services refer to services provisioned by networked infrastructure, which are subject to both public obligation and market rules. Therefore, their accessibility to all citizens needs to be safeguarded. The disparity of growth between networked infrastructure and socio-economic dynamics affects the sustainable assimilation and equal access to infrastructure in various districts in cities, rendering it as a privilege. Yet, the empirical evidence of whether the place of residence acts as a disadvantage to public service access and use, remains rather scarce (Clifton et al., 2016). The European Union recognized (EU, 2011) the issue of equality in accessibility (i.e. equity) critical for territorial cohesion and sustainable development across districts, municipalities and regions with diverse economic performance. Territorial cohesion, formally incorporated into the Treaty of Lisbon, now steers the policy frameworks of territorial development within the Union. Subsequently, the European Union developed a policy paradigm guided by equal access (Clifton et al., 2016) to public infrastructure services, considering their accessibility as instrumental aspect in achieving territorial cohesion across and within its member states. A corollary of increasing the equity to public infrastructure services among growing global population is the potential increase in environmental pressure they can impose, especially if this pressure is not decentralised and surges at unsustainable rate (Neuman, 2006). This danger varies across countries and continents, and is directly linked to the increase of urban population due to; [1] improved quality of life and increased life expectancy and/or [2] urban in-migration of rural population and/or [3] global political or economic immigration. These three rising urban trends demand new approaches to reimagine planning and design practices that foster infrastructure equity, whilst delivering environmental justice. Therefore, this research explores in depth the nature of growth of networked infrastructure (Graham and Marvin, 2001) as a complex system and its disparity from the socio-economic growth (or decline) of Glasgow and Clyde Valley city-region. The results of this research gain new understanding in the potential of using emerging tools from network science for developing optimization strategy that supports more cecentralized, efficient, fair and (as an outcome) sustainable enlargement of urban infrastructure, to accommodate new and empower current residents of the city. Applying the novel link clustering community detection algorithm (Ahn et al., 2010) in this thesis I have presented the potential for better understanding the complexity behind the urban system of networked infrastructure, through discovering their overlapping communities. As I will show in the literature review (Chapter 2), the long standing tradition of centralised planning practice relying on zoning and infiltrating infrastructure, left us with urban settlements which are failing to respond to the environmental pressure and the socio-economic inequalities. Building on the myriad of knowledge from planners, geographers, sociologists and computer scientists, I developed a new element (i.e. link communities) within the theory of urban studies that defines cities as complex systems. After, I applied a method borrowed from the study of complex networks to unpack their basic elements. Knowing the link (i.e. functional, or overlapping) communities of metropolitan Glasgow enabled me to evaluate the current level of communities interconnectedness and reveal the gaps as well as the potentials for improving the studied system’s performance. The complex urban system in metropolitan Glasgow was represented by its networked infrastructure, which essentially was a system of distinct sub-systems, one of them mapped by a physical and the other one by a social graph. The conceptual framework for this methodological approach was formalised from the extensively reviewed literature and methods utilising network science tools to detect community structure in complex networks. The literature review led to constructing a hypothesis claiming that the efficiency of the physical network’s topology is achieved through optimizing the number of nodes with high betweenness centrality, while the efficiency of the logical network’s topology is achieved by optimizing the number of links with high edge betweenness. The conclusion from the literature review presented through the discourse on to the primal problem in 7.4.1, led to modelling the two network topologies as separate graphs. The bipartite graph of their primal syntax was mirrored to be symmetrical and converted to dual. From the dual syntax I measured the complete accessibility (i.e. betweenness centrality) of the entire area and not only of the streets. Betweenness centrality of a node measures the number of shortest paths that pass through the node connecting pairs of nodes. The betweenness centrality is same as the integration of streets in space syntax, where the streets are analysed in their dual syntax representation. Street integration is the number of intersections the street shares with other streets and a high value means high accessibility. Edges with high betweenness are shared between strong communities. Based on the theoretical underpinnings of the network’s modularity and community structure analysed herein, it can be concluded that a complex network that is both robust and efficient (and in urban planning terminology ‘sustainable’) is consisted of numerous strong communities connected with each other by optimal number of links with high edge betweenness. To get this insight, the study detected the edge cut-set and vertex cut-set of the complex network. The outcome was a statistical model developed in the open source software R (Ihaka and Gentleman, 1996). The model empirical detects the network’s overlapping communities, determining the current sustainability of its physical and logical topologies. Initially, an assumption was that the number of communities within the infrastructure (physical) network layer were different from the one in the logical. They were detected using the Louvain method that performs graph partitioning on the hierarchical streets structure. Further, the number of communities in the relational network layer (i.e. accessibility to locations) was detected based on the OD accessibility matrix established from the functional dependency between the household locations and predefined points of interest. The communities from the graph of the ‘relational layer' were discovered with the single-link hierarchical clustering algorithm. The number of communities observed in the physical and the logical topologies of the eight shires significantly deviated

WestminsterResearch

Optimization techniques for human computation-enabled data processing systems

Author: Marcus Adam, Ph. D. Massachusetts Institute of Technology
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 119-124).Crowdsourced labor markets make it possible to recruit large numbers of people to complete small tasks that are difficult to automate on computers. These marketplaces are increasingly widely used, with projections of over $1 billion being transferred between crowd employers and crowd workers by the end of 2012. While crowdsourcing enables forms of computation that artificial intelligence has not yet achieved, it also presents crowd workflow designers with a series of challenges including describing tasks, pricing tasks, identifying and rewarding worker quality, dealing with incorrect responses, and integrating human computation into traditional programming frameworks. In this dissertation, we explore the systems-building, operator design, and optimization challenges involved in building a crowd-powered workflow management system. We describe a system called Qurk that utilizes techniques from databases such as declarative workflow definition, high-latency workflow execution, and query optimization to aid crowd-powered workflow developers. We study how crowdsourcing can enhance the capabilities of traditional databases by evaluating how to implement basic database operators such as sorts and joins on datasets that could not have been processed using traditional computation frameworks. Finally, we explore the symbiotic relationship between the crowd and query optimization, enlisting crowd workers to perform selectivity estimation, a key component in optimizing complex crowd-powered workflows.by Adam Marcus.Ph.D

DSpace@MIT

Enhancing knowledge acquisition systems with user generated and crowdsourced resources

Author: Xu Fang
Publication venue: Fakultät 7 - Naturwissenschaftlich-Technische Fakultät II. Fachrichtung 7.4 - Mechatronik
Publication date: 01/01/2012
Field of study

This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and ﬁnally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren