14,509 research outputs found
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
Reconciling Synthesis and Decomposition: A Composite Approach to Capability Identification
Stakeholders' expectations and technology constantly evolve during the
lengthy development cycles of a large-scale computer based system.
Consequently, the traditional approach of baselining requirements results in an
unsatisfactory system because it is ill-equipped to accommodate such change. In
contrast, systems constructed on the basis of Capabilities are more
change-tolerant; Capabilities are functional abstractions that are neither as
amorphous as user needs nor as rigid as system requirements. Alternatively,
Capabilities are aggregates that capture desired functionality from the users'
needs, and are designed to exhibit desirable software engineering
characteristics of high cohesion, low coupling and optimum abstraction levels.
To formulate these functional abstractions we develop and investigate two
algorithms for Capability identification: Synthesis and Decomposition. The
synthesis algorithm aggregates detailed rudimentary elements of the system to
form Capabilities. In contrast, the decomposition algorithm determines
Capabilities by recursively partitioning the overall mission of the system into
more detailed entities. Empirical analysis on a small computer based library
system reveals that neither approach is sufficient by itself. However, a
composite algorithm based on a complementary approach reconciling the two polar
perspectives results in a more feasible set of Capabilities. In particular, the
composite algorithm formulates Capabilities using the cohesion and coupling
measures as defined by the decomposition algorithm and the abstraction level as
determined by the synthesis algorithm.Comment: This paper appears in the 14th Annual IEEE International Conference
and Workshop on the Engineering of Computer Based Systems (ECBS); 10 pages, 9
figure
Agnostic cosmology in the CAMEL framework
Cosmological parameter estimation is traditionally performed in the Bayesian
context. By adopting an "agnostic" statistical point of view, we show the
interest of confronting the Bayesian results to a frequentist approach based on
profile-likelihoods. To this purpose, we have developed the Cosmological
Analysis with a Minuit Exploration of the Likelihood ("CAMEL") software.
Written from scratch in pure C++, emphasis was put in building a clean and
carefully-designed project where new data and/or cosmological computations can
be easily included.
CAMEL incorporates the latest cosmological likelihoods and gives access from
the very same input file to several estimation methods: (i) A high quality
Maximum Likelihood Estimate (a.k.a "best fit") using MINUIT ; (ii) profile
likelihoods, (iii) a new implementation of an Adaptive Metropolis MCMC
algorithm that relieves the burden of reconstructing the proposal distribution.
We present here those various statistical techniques and roll out a full
use-case that can then used as a tutorial. We revisit the CDM
parameters determination with the latest Planck data and give results with both
methodologies. Furthermore, by comparing the Bayesian and frequentist
approaches, we discuss a "likelihood volume effect" that affects the optical
reionization depth when analyzing the high multipoles part of the Planck data.
The software, used in several Planck data analyzes, is available from
http://camel.in2p3.fr. Using it does not require advanced C++ skills.Comment: Typeset in Authorea. Online version available at:
https://www.authorea.com/users/90225/articles/104431/_show_articl
Synthesizing and tuning chemical reaction networks with specified behaviours
We consider how to generate chemical reaction networks (CRNs) from functional
specifications. We propose a two-stage approach that combines synthesis by
satisfiability modulo theories and Markov chain Monte Carlo based optimisation.
First, we identify candidate CRNs that have the possibility to produce correct
computations for a given finite set of inputs. We then optimise the reaction
rates of each CRN using a combination of stochastic search techniques applied
to the chemical master equation, simultaneously improving the of correct
behaviour and ruling out spurious solutions. In addition, we use techniques
from continuous time Markov chain theory to study the expected termination time
for each CRN. We illustrate our approach by identifying CRNs for majority
decision-making and division computation, which includes the identification of
both known and unknown networks.Comment: 17 pages, 6 figures, appeared the proceedings of the 21st conference
on DNA Computing and Molecular Programming, 201
TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks
We present a framework for specifying, training, evaluating, and deploying
machine learning models. Our focus is on simplifying cutting edge machine
learning for practitioners in order to bring such technologies into production.
Recognizing the fast evolution of the field of deep learning, we make no
attempt to capture the design space of all possible model architectures in a
domain- specific language (DSL) or similar configuration language. We allow
users to write code to define their models, but provide abstractions that guide
develop- ers to write models in ways conducive to productionization. We also
provide a unifying Estimator interface, making it possible to write downstream
infrastructure (e.g. distributed training, hyperparameter tuning) independent
of the model implementation. We balance the competing demands for flexibility
and simplicity by offering APIs at different levels of abstraction, making
common model architectures available out of the box, while providing a library
of utilities designed to speed up experimentation with model architectures. To
make out of the box models flexible and usable across a wide range of problems,
these canned Estimators are parameterized not only over traditional
hyperparameters, but also using feature columns, a declarative specification
describing how to interpret input data. We discuss our experience in using this
framework in re- search and production environments, and show the impact on
code health, maintainability, and development speed.Comment: 8 pages, Appeared at KDD 2017, August 13--17, 2017, Halifax, NS,
Canad
- …