2,155 research outputs found
Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
MapReduce is a popular programming paradigm for developing large-scale,
data-intensive computation. Many frameworks that implement this paradigm have
recently been developed. To leverage these frameworks, however, developers must
become familiar with their APIs and rewrite existing code. Casper is a new tool
that automatically translates sequential Java programs into the MapReduce
paradigm. Casper identifies potential code fragments to rewrite and translates
them in two steps: (1) Casper uses program synthesis to search for a program
summary (i.e., a functional specification) of each code fragment. The summary
is expressed using a high-level intermediate language resembling the MapReduce
paradigm and verified to be semantically equivalent to the original using a
theorem prover. (2) Casper generates executable code from the summary, using
either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically
converting real-world, sequential Java benchmarks to MapReduce. The resulting
benchmarks perform up to 48.2x faster compared to the original.Comment: 12 pages, additional 4 pages of references and appendi
Recommended from our members
A MapReduce architecture for web site user behaviour monitoring in real time
Monitoring the behaviour of large numbers of web site users in real time poses significant performance challenges, due to the decentralised location and volume of generated data. This paper proposes a MapReduce-style architecture where the processing of event series from the Web users is performed by a number of cascading mappers, reducers and rereducers, local to the event origin. With the use of static analysis and a prototype implementation, we show how this architecture is capable to carry out time series analysis in real time for very large web data sets, based on the actual events, instead of resorting to sampling or other extrapolation techniques
DEMO: integrating MPC in big data workflows
Secure multi-party computation (MPC) allows multiple parties to perform a joint computation without disclosing their private inputs. Many real-world joint computation use cases, however, involve data analyses on very large data sets, and are implemented by software engineers who lack MPC knowledge. Moreover, the collaborating parties -- e.g., several companies -- often deploy different data analytics stacks internally. These restrictions hamper the real-world usability of MPC. To address these challenges, we combine existing MPC frameworks with data-parallel analytics frameworks by extending the Musketeer big data workflow manager [4]. Musketeer automatically generates code for both the sensitive parts of a workflow, which are executed in MPC, and the remainder of the computation, which runs on scalable, widely-deployed analytics systems. In a prototype use case, we compute the Herfindahl-Hirschman Index (HHI), an index of market concentration used in antitrust regulation, on an aggregate 156GB of taxi trip data over five transportation companies. Our implementation computes the HHI in about 20 minutes using a combination of Hadoop and VIFF [1], while even "mixed mode" MPC with VIFF alone would have taken many hours. Finally, we discuss future research questions that we seek to address using our approach
- …