46,033 research outputs found

    PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

    Full text link
    This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.Comment: 48 pages, including references and Appendi

    Evolving Spatially Aggregated Features from Satellite Imagery for Regional Modeling

    Full text link
    Satellite imagery and remote sensing provide explanatory variables at relatively high resolutions for modeling geospatial phenomena, yet regional summaries are often desirable for analysis and actionable insight. In this paper, we propose a novel method of inducing spatial aggregations as a component of the machine learning process, yielding regional model features whose construction is driven by model prediction performance rather than prior assumptions. Our results demonstrate that Genetic Programming is particularly well suited to this type of feature construction because it can automatically synthesize appropriate aggregations, as well as better incorporate them into predictive models compared to other regression methods we tested. In our experiments we consider a specific problem instance and real-world dataset relevant to predicting snow properties in high-mountain Asia

    Approaches to integrated strategic/tactical forest planning

    Get PDF
    Traditionally forest planning is divided into a hierarchy of planning phases. Strategic planning is conducted to make decisions about sustainable harvest levels while taking into account legislation and policy issues. Within the frame of the strategic plan, the purpose of tactical planning is to schedule harvest operations to specific areas in the immediate few years and on a finer time scale than in the strategic plan. The operative phase focuses on scheduling harvest crews on a monthly or weekly basis, truck scheduling and choosing bucking instructions. Decisions at each level are to a varying degree supported by computerized tools. A problem that may arise when planning is divided into levels and that is noted in the literature focusing on decision support tools is that solutions at one level may be inconsistent with the results of another level. When moving from the strategic plan to the tactical plan, three sources of inconsistencies are often present; spatial discrepancies, temporal discrepancies and discrepancies due to different levels of constraint. The models used in the papers presented in this thesis approaches two of these discrepancies. To address the spatial discrepancies, the same spatial resolution has been used at both levels, i.e., stands. Temporal discrepancies are addressed by modelling the tactical and strategic issues simultaneously. Integrated approaches can yield large models. One way of circumventing this is to aggregate time and/or space. The first paper addresses the consequences of temporal aggregation in the strategic part of a mixed integer programming integrated strategic/tactical model. For reference, linear programming based strategic models are also used. The results of the first paper provide information on what temporal resolutions could be used and indicate that outputs from strategic and integrated plans are not particularly affected by the number of equal length strategic periods when more than five periods, i.e. about 20 year period length, are used. The approach used in the first paper could produce models that are very large, and the second paper provides a two-stage procedure that can reduce the number of variables and preserve the allocation of stands to the first 10 years provided by a linear programming based strategic plan, while concentrating tactical harvest activities using a penalty concept in a mixed integer programming formulation. Results show that it is possible to use the approach to concentrate harvest activities at the tactical level in a full scale forest management scenario. In the case study, the effects of concentration on strategic outputs were small, and the number of harvest tracts declined towards a minimum level. Furthermore, the discrepancies between the two planning levels were small

    Problem-driven scenario generation: an analytical approach for stochastic programs with tail risk measure

    Get PDF
    Scenario generation is the construction of a discrete random vector to represent parameters of uncertain values in a stochastic program. Most approaches to scenario generation are distribution-driven, that is, they attempt to construct a random vector which captures well in a probabilistic sense the uncertainty. On the other hand, a problem-driven approach may be able to exploit the structure of a problem to provide a more concise representation of the uncertainty. In this paper we propose an analytic approach to problem-driven scenario generation. This approach applies to stochastic programs where a tail risk measure, such as conditional value-at-risk, is applied to a loss function. Since tail risk measures only depend on the upper tail of a distribution, standard methods of scenario generation, which typically spread their scenarios evenly across the support of the random vector, struggle to adequately represent tail risk. Our scenario generation approach works by targeting the construction of scenarios in areas of the distribution corresponding to the tails of the loss distributions. We provide conditions under which our approach is consistent with sampling, and as proof-of-concept demonstrate how our approach could be applied to two classes of problem, namely network design and portfolio selection. Numerical tests on the portfolio selection problem demonstrate that our approach yields better and more stable solutions compared to standard Monte Carlo sampling

    Temporal Feature Selection with Symbolic Regression

    Get PDF
    Building and discovering useful features when constructing machine learning models is the central task for the machine learning practitioner. Good features are useful not only in increasing the predictive power of a model but also in illuminating the underlying drivers of a target variable. In this research we propose a novel feature learning technique in which Symbolic regression is endowed with a ``Range Terminal\u27\u27 that allows it to explore functions of the aggregate of variables over time. We test the Range Terminal on a synthetic data set and a real world data in which we predict seasonal greenness using satellite derived temperature and snow data over a portion of the Arctic. On the synthetic data set we find Symbolic regression with the Range Terminal outperforms standard Symbolic regression and Lasso regression. On the Arctic data set we find it outperforms standard Symbolic regression, fails to beat the Lasso regression, but finds useful features describing the interaction between Land Surface Temperature, Snow, and seasonal vegetative growth in the Arctic

    Monotonicity preserving approximation of multivariate scattered data

    Full text link
    This paper describes a new method of monotone interpolation and smoothing of multivariate scattered data. It is based on the assumption that the function to be approximated is Lipschitz continuous. The method provides the optimal approximation in the worst case scenario and tight error bounds. Smoothing of noisy data subject to monotonicity constraints is converted into a quadratic programming problem. Estimation of the unknown Lipschitz constant from the data by sample splitting and cross-validation is described. Extension of the method for locally Lipschitz functions is presented.<br /
    • …
    corecore