1,483 research outputs found
Engineering Crowdsourced Stream Processing Systems
A crowdsourced stream processing system (CSP) is a system that incorporates
crowdsourced tasks in the processing of a data stream. This can be seen as
enabling crowdsourcing work to be applied on a sample of large-scale data at
high speed, or equivalently, enabling stream processing to employ human
intelligence. It also leads to a substantial expansion of the capabilities of
data processing systems. Engineering a CSP system requires the combination of
human and machine computation elements. From a general systems theory
perspective, this means taking into account inherited as well as emerging
properties from both these elements. In this paper, we position CSP systems
within a broader taxonomy, outline a series of design principles and evaluation
metrics, present an extensible framework for their design, and describe several
design patterns. We showcase the capabilities of CSP systems by performing a
case study that applies our proposed framework to the design and analysis of a
real system (AIDR) that classifies social media messages during time-critical
crisis events. Results show that compared to a pure stream processing system,
AIDR can achieve a higher data classification accuracy, while compared to a
pure crowdsourcing solution, the system makes better use of human workers by
requiring much less manual work effort
Crowdbreaks: Tracking Health Trends using Public Social Media Data and Crowdsourcing
In the past decade, tracking health trends using social media data has shown
great promise, due to a powerful combination of massive adoption of social
media around the world, and increasingly potent hardware and software that
enables us to work with these new big data streams. At the same time, many
challenging problems have been identified. First, there is often a mismatch
between how rapidly online data can change, and how rapidly algorithms are
updated, which means that there is limited reusability for algorithms trained
on past data as their performance decreases over time. Second, much of the work
is focusing on specific issues during a specific past period in time, even
though public health institutions would need flexible tools to assess multiple
evolving situations in real time. Third, most tools providing such capabilities
are proprietary systems with little algorithmic or data transparency, and thus
little buy-in from the global public health and research community. Here, we
introduce Crowdbreaks, an open platform which allows tracking of health trends
by making use of continuous crowdsourced labelling of public social media
content. The system is built in a way which automatizes the typical workflow
from data collection, filtering, labelling and training of machine learning
classifiers and therefore can greatly accelerate the research process in the
public health domain. This work introduces the technical aspects of the
platform and explores its future use cases
QoE-Aware Resource Allocation For Crowdsourced Live Streaming: A Machine Learning Approach
In the last decade, empowered by the technological advancements of mobile devices
and the revolution of wireless mobile network access, the world has witnessed an
explosion in crowdsourced live streaming. Ensuring a stable high-quality playback
experience is compulsory to maximize the viewers’ Quality of Experience and the
content providers’ profits. This can be achieved by advocating a geo-distributed cloud
infrastructure to allocate the multimedia resources as close as possible to viewers, in
order to minimize the access delay and video stalls.
Additionally, because of the instability of network condition and the heterogeneity of
the end-users capabilities, transcoding the original video into multiple bitrates is
required. Video transcoding is a computationally expensive process, where generally a
single cloud instance needs to be reserved to produce one single video bitrate
representation. On demand renting of resources or inadequate resources reservation
may cause delay of the video playback or serving the viewers with a lower quality. On
the other hand, if resources provisioning is much higher than the required, the
extra resources will be wasted.
In this thesis, we introduce a prediction-driven resource allocation framework, to
maximize the QoE of viewers and minimize the resources allocation cost. First, by
exploiting the viewers’ locations available in our unique dataset, we implement a machine learning model to predict the viewers’ number near each geo-distributed cloud
site. Second, based on the predicted results that showed to be close to the actual values,
we formulate an optimization problem to proactively allocate resources at the viewers’
proximity. Additionally, we will present a trade-off between the video access delay and
the cost of resource allocation.
Considering the complexity and infeasibility of our offline optimization to respond to
the volume of viewing requests in real-time, we further extend our work, by introducing
a resources forecasting and reservation framework for geo-distributed cloud sites. First,
we formulate an offline optimization problem to allocate transcoding resources at the
viewers’ proximity, while creating a tradeoff between the network cost and viewers
QoE. Second, based on the optimizer resource allocation decisions on historical live
videos, we create our time series datasets containing historical records of the optimal
resources needed at each geo-distributed cloud site. Finally, we adopt machine learning
to build our distributed time series forecasting models to proactively forecast the exact
needed transcoding resources ahead of time at each geo-distributed cloud site.
The results showed that the predicted number of transcoding resources needed in each
cloud site is close to the optimal number of transcoding resources
Equality of Voice: Towards Fair Representation in Crowdsourced Top-K Recommendations
To help their users to discover important items at a particular time, major
websites like Twitter, Yelp, TripAdvisor or NYTimes provide Top-K
recommendations (e.g., 10 Trending Topics, Top 5 Hotels in Paris or 10 Most
Viewed News Stories), which rely on crowdsourced popularity signals to select
the items. However, different sections of a crowd may have different
preferences, and there is a large silent majority who do not explicitly express
their opinion. Also, the crowd often consists of actors like bots, spammers, or
people running orchestrated campaigns. Recommendation algorithms today largely
do not consider such nuances, hence are vulnerable to strategic manipulation by
small but hyper-active user groups.
To fairly aggregate the preferences of all users while recommending top-K
items, we borrow ideas from prior research on social choice theory, and
identify a voting mechanism called Single Transferable Vote (STV) as having
many of the fairness properties we desire in top-K item (s)elections. We
develop an innovative mechanism to attribute preferences of silent majority
which also make STV completely operational. We show the generalizability of our
approach by implementing it on two different real-world datasets. Through
extensive experimentation and comparison with state-of-the-art techniques, we
show that our proposed approach provides maximum user satisfaction, and cuts
down drastically on items disliked by most but hyper-actively promoted by a few
users.Comment: In the proceedings of the Conference on Fairness, Accountability, and
Transparency (FAT* '19). Please cite the conference versio
- …