14,509 research outputs found

    MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!

    Full text link
    Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research

    Reconciling Synthesis and Decomposition: A Composite Approach to Capability Identification

    Full text link
    Stakeholders' expectations and technology constantly evolve during the lengthy development cycles of a large-scale computer based system. Consequently, the traditional approach of baselining requirements results in an unsatisfactory system because it is ill-equipped to accommodate such change. In contrast, systems constructed on the basis of Capabilities are more change-tolerant; Capabilities are functional abstractions that are neither as amorphous as user needs nor as rigid as system requirements. Alternatively, Capabilities are aggregates that capture desired functionality from the users' needs, and are designed to exhibit desirable software engineering characteristics of high cohesion, low coupling and optimum abstraction levels. To formulate these functional abstractions we develop and investigate two algorithms for Capability identification: Synthesis and Decomposition. The synthesis algorithm aggregates detailed rudimentary elements of the system to form Capabilities. In contrast, the decomposition algorithm determines Capabilities by recursively partitioning the overall mission of the system into more detailed entities. Empirical analysis on a small computer based library system reveals that neither approach is sufficient by itself. However, a composite algorithm based on a complementary approach reconciling the two polar perspectives results in a more feasible set of Capabilities. In particular, the composite algorithm formulates Capabilities using the cohesion and coupling measures as defined by the decomposition algorithm and the abstraction level as determined by the synthesis algorithm.Comment: This paper appears in the 14th Annual IEEE International Conference and Workshop on the Engineering of Computer Based Systems (ECBS); 10 pages, 9 figure

    Agnostic cosmology in the CAMEL framework

    Full text link
    Cosmological parameter estimation is traditionally performed in the Bayesian context. By adopting an "agnostic" statistical point of view, we show the interest of confronting the Bayesian results to a frequentist approach based on profile-likelihoods. To this purpose, we have developed the Cosmological Analysis with a Minuit Exploration of the Likelihood ("CAMEL") software. Written from scratch in pure C++, emphasis was put in building a clean and carefully-designed project where new data and/or cosmological computations can be easily included. CAMEL incorporates the latest cosmological likelihoods and gives access from the very same input file to several estimation methods: (i) A high quality Maximum Likelihood Estimate (a.k.a "best fit") using MINUIT ; (ii) profile likelihoods, (iii) a new implementation of an Adaptive Metropolis MCMC algorithm that relieves the burden of reconstructing the proposal distribution. We present here those various statistical techniques and roll out a full use-case that can then used as a tutorial. We revisit the Λ\LambdaCDM parameters determination with the latest Planck data and give results with both methodologies. Furthermore, by comparing the Bayesian and frequentist approaches, we discuss a "likelihood volume effect" that affects the optical reionization depth when analyzing the high multipoles part of the Planck data. The software, used in several Planck data analyzes, is available from http://camel.in2p3.fr. Using it does not require advanced C++ skills.Comment: Typeset in Authorea. Online version available at: https://www.authorea.com/users/90225/articles/104431/_show_articl

    Synthesizing and tuning chemical reaction networks with specified behaviours

    Full text link
    We consider how to generate chemical reaction networks (CRNs) from functional specifications. We propose a two-stage approach that combines synthesis by satisfiability modulo theories and Markov chain Monte Carlo based optimisation. First, we identify candidate CRNs that have the possibility to produce correct computations for a given finite set of inputs. We then optimise the reaction rates of each CRN using a combination of stochastic search techniques applied to the chemical master equation, simultaneously improving the of correct behaviour and ruling out spurious solutions. In addition, we use techniques from continuous time Markov chain theory to study the expected termination time for each CRN. We illustrate our approach by identifying CRNs for majority decision-making and division computation, which includes the identification of both known and unknown networks.Comment: 17 pages, 6 figures, appeared the proceedings of the 21st conference on DNA Computing and Molecular Programming, 201

    TensorFlow Estimators: Managing Simplicity vs. Flexibility in High-Level Machine Learning Frameworks

    Full text link
    We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is on simplifying cutting edge machine learning for practitioners in order to bring such technologies into production. Recognizing the fast evolution of the field of deep learning, we make no attempt to capture the design space of all possible model architectures in a domain- specific language (DSL) or similar configuration language. We allow users to write code to define their models, but provide abstractions that guide develop- ers to write models in ways conducive to productionization. We also provide a unifying Estimator interface, making it possible to write downstream infrastructure (e.g. distributed training, hyperparameter tuning) independent of the model implementation. We balance the competing demands for flexibility and simplicity by offering APIs at different levels of abstraction, making common model architectures available out of the box, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models flexible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative specification describing how to interpret input data. We discuss our experience in using this framework in re- search and production environments, and show the impact on code health, maintainability, and development speed.Comment: 8 pages, Appeared at KDD 2017, August 13--17, 2017, Halifax, NS, Canad
    corecore