333 research outputs found
On data skewness, stragglers, and MapReduce progress indicators
We tackle the problem of predicting the performance of MapReduce
applications, designing accurate progress indicators that keep programmers
informed on the percentage of completed computation time during the execution
of a job. Through extensive experiments, we show that state-of-the-art progress
indicators (including the one provided by Hadoop) can be seriously harmed by
data skewness, load unbalancing, and straggling tasks. This is mainly due to
their implicit assumption that the running time depends linearly on the input
size. We thus design a novel profile-guided progress indicator, called
NearestFit, that operates without the linear hypothesis assumption and exploits
a careful combination of nearest neighbor regression and statistical curve
fitting techniques. Our theoretical progress model requires fine-grained
profile data, that can be very difficult to manage in practice. To overcome
this issue, we resort to computing accurate approximations for some of the
quantities used in our model through space- and time-efficient data streaming
algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive
empirical assessment over the Amazon EC2 platform on a variety of real-world
benchmarks shows that NearestFit is practical w.r.t. space and time overheads
and that its accuracy is generally very good, even in scenarios where
competitors incur non-negligible errors and wide prediction fluctuations.
Overall, NearestFit significantly improves the current state-of-art on progress
analysis for MapReduce
Low latency via redundancy
Low latency is critical for interactive networked applications. But while we
know how to scale systems to increase capacity, reducing latency --- especially
the tail of the latency distribution --- can be much more difficult. In this
paper, we argue that the use of redundancy is an effective way to convert extra
capacity into reduced latency. By initiating redundant operations across
diverse resources and using the first result which completes, redundancy
improves a system's latency even under exceptional conditions. We study the
tradeoff with added system utilization, characterizing the situations in which
replicating all tasks reduces mean latency. We then demonstrate empirically
that replicating all operations can result in significant mean and tail latency
reduction in real-world systems including DNS queries, database servers, and
packet forwarding within networks
Modelling e-commerce customer reactions. Exploring online shopping carnivals in China
This research investigates customer reactions by exploring
satisfaction(SAT), complaints(CC) and loyalty(CL) in an online
shopping carnival(OSC) context in China. Expanding the American
Customer Satisfaction Index(ACSI) model by including e-commerce corporate image(ECCI) next to customer expectations(CE),
perceived quality(PQ), perceived value(PV), SAT was determined,
while CC and CL were estimated based on SAT. For estimating
CL, ECCI was added. 300 valid questionnaires were collected from
Chinese shoppers with OSC experience. The research hypotheses
were tested through Confirmatory Factor Analysis and Structural
Equation Modelling. The results prompt five key paths influencing
SAT and CL. No significant impact on and of CC was identified.
ECCI significantly impacted on CC, SAT and CL. This study provides in the context of OSCs a new research perspective of customer reactions, centred on satisfaction, emphasising the role of
image on expectations, satisfaction and loyalty, and incorporating
customer complaints to quantify negative aspects of shopping
experience in determining customer loyalty. E-commerce companies should deliver unforgettable customer experience through
building a long-lasting image, offering consistent quality and
delivering clearly-delineated value, as antecedents of satisfaction
and loyalty. The model can be further expanded by exploring the
consequences of customer loyalty on potential buying behaviour,
focusing on purchasing intention and recommendations
Random Hyper-parameter Search-Based Deep Neural Network for Power Consumption Forecasting
In this paper, we introduce a deep learning approach, based on feed-forward
neural networks, for big data time series forecasting with arbitrary prediction horizons.
We firstly propose a random search to tune the multiple hyper-parameters involved in
the method perfor-mance. There is a twofold objective for this search: firstly, to improve
the forecasts and, secondly, to decrease the learning time. Next, we pro-pose a procedure
based on moving averages to smooth the predictions obtained by the different models
considered for each value of the pre-diction horizon. We conduct a comprehensive
evaluation using a real-world dataset composed of electricity consumption in Spain,
evaluating accuracy and comparing the performance of the proposed deep learning with
a grid search and a random search without applying smoothing. Reported results show
that a random search produces competitive accu-racy results generating a smaller
number of models, and the smoothing process reduces the forecasting error.Ministerio de Economía y Competitividad TIN2017-88209-C2-1-
Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
Retrieval-augmented in-context learning has emerged as a powerful approach
for addressing knowledge-intensive tasks using frozen language models (LM) and
retrieval models (RM). Existing work has combined these in simple
"retrieve-then-read" pipelines in which the RM retrieves passages that are
inserted into the LM prompt. To begin to fully realize the potential of frozen
LMs and RMs, we propose Demonstrate-Search-Predict (DSP), a framework that
relies on passing natural language texts in sophisticated pipelines between an
LM and an RM. DSP can express high-level programs that bootstrap pipeline-aware
demonstrations, search for relevant passages, and generate grounded
predictions, systematically breaking down problems into small transformations
that the LM and RM can handle more reliably. We have written novel DSP programs
for answering questions in open-domain, multi-hop, and conversational settings,
establishing in early evaluations new state-of-the-art in-context learning
results and delivering 37-200%, 8-40%, and 80-290% relative gains against
vanilla LMs, a standard retrieve-then-read pipeline, and a contemporaneous
self-ask pipeline, respectively
Hybrid, Optical and Wireless Near-Gigabit Communications System
This paper presents the study and the realization of a hybrid 60 GHz wireless
communications system. As the 60 GHz radio link operates only in a single-room
configuration, an additional Radio over Fibre (RoF) link is used to ensure the
communications in all the rooms of a residential environment. A single carrier
architecture is adopted. The system uses low complexity baseband processing
modules. A byte/frame synchronization technique is designed to provide a high
value of the preamble detection probability and a very small value of the false
alarm probability. Conventional RS (255, 239) encoder and decoder are used to
correct errors in the transmission channel. Results of Bit Error Rate (BER)
measurements are presented for various antennas configurations
Modeling ring current ion and electron dynamics and plasma instabilities during a high‐speed stream driven storm
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/94593/1/jgra21840.pd
Discretized streams: A fault-tolerant model for scalable stream processing
Abstract Many "big data" applications need to act on data arriving in real time. However, current programming models for distributed stream processing are relatively low-level, often leaving the user to worry about consistency of state across the system and fault recovery. Furthermore, the models that provide fault recovery do so in an expensive manner, requiring either hot replication or long recovery times. We propose a new programming model, discretized streams (D-Streams), that offers a high-level functional API, strong consistency, and efficient fault recovery. D-Streams support a new recovery mechanism that improves efficiency over the traditional replication and upstream backup schemes in streaming databasesparallel recovery of lost state-and unlike previous systems, also mitigate stragglers. We implement D-Streams as an extension to the Spark cluster computing engine that lets users seamlessly intermix streaming, batch and interactive queries. Our system can process over 60 million records/second at sub-second latency on 100 nodes
- …