1,066 research outputs found

    A Domain Specific Approach to High Performance Heterogeneous Computing

    Full text link
    Users of heterogeneous computing systems face two problems: firstly, in understanding the trade-off relationships between the observable characteristics of their applications, such as latency and quality of the result, and secondly, how to exploit knowledge of these characteristics to allocate work to distributed computing platforms efficiently. A domain specific approach addresses both of these problems. By considering a subset of operations or functions, models of the observable characteristics or domain metrics may be formulated in advance, and populated at run-time for task instances. These metric models can then be used to express the allocation of work as a constrained integer program, which can be solved using heuristics, machine learning or Mixed Integer Linear Programming (MILP) frameworks. These claims are illustrated using the example domain of derivatives pricing in computational finance, with the domain metrics of workload latency or makespan and pricing accuracy. For a large, varied workload of 128 Black-Scholes and Heston model-based option pricing tasks, running upon a diverse array of 16 Multicore CPUs, GPUs and FPGAs platforms, predictions made by models of both the makespan and accuracy are generally within 10% of the run-time performance. When these models are used as inputs to machine learning and MILP-based workload allocation approaches, a latency improvement of up to 24 and 270 times over the heuristic approach is seen.Comment: 14 pages, preprint draft, minor revisio

    Seeing Shapes in Clouds: On the Performance-Cost trade-off for Heterogeneous Infrastructure-as-a-Service

    Full text link
    In the near future FPGAs will be available by the hour, however this new Infrastructure as a Service (IaaS) usage mode presents both an opportunity and a challenge: The opportunity is that programmers can potentially trade resources for performance on a much larger scale, for much shorter periods of time than before. The challenge is in finding and traversing the trade-off for heterogeneous IaaS that guarantees increased resources result in the greatest possible increased performance. Such a trade-off is Pareto optimal. The Pareto optimal trade-off for clusters of heterogeneous resources can be found by solving multiple, multi-objective optimisation problems, resulting in an optimal allocation of tasks to the available platforms. Solving these optimisation programs can be done using simple heuristic approaches or formal Mixed Integer Linear Programming (MILP) techniques. When pricing 128 financial options using a Monte Carlo algorithm upon a heterogeneous cluster of Multicore CPU, GPU and FPGA platforms, the MILP approach produces a trade-off that is up to 110% faster than a heuristic approach, and over 50% cheaper. These results suggest that high quality performance-resource trade-offs of heterogeneous IaaS are best realised through a formal optimisation approach.Comment: Presented at Second International Workshop on FPGAs for Software Programmers (FSP 2015) (arXiv:1508.06320

    Fast American Basket Option Pricing on a multi-GPU Cluster

    Get PDF
    8 pagesInternational audienceThis article presents a multi-GPU adaptation of a specific Monte Carlo and classification based method for pricing American basket options, due to Picazo. The first part relates how to combine fine and coarse-grained parallelization to price American basket options. A dynamic strategy of kernel calibration is proposed. Doing so, our implementation on a reasonable size (18) GPU cluster achieves the pricing of a high dimensional (40) option in less than one hour against almost 8 as observed for runs we conducted in the past, using a 64-core cluster (composed of quad-core AMD Opteron 2356). In order to benefit from different GPU device types, we detail the dynamic strategy we have used to load balance GPU calculus which greatly improves the overall pricing time we obtained. An analysis of possible bottleneck effects demonstrates that there is a sequential bottleneck due to the training phase that relies upon the AdaBoost classification method, which prevents the implementation to be fully scalable, and so prevents to envision further decreasing pricing time down to handful of minutes. For this we propose to consider using Random Forests classification method: it is naturally dividable over a cluster, and available like AdaBoost as a black box from the popular Weka machine learning library. However our experimental tests will show that its use is costly

    Massively Parallel Computation Using Graphics Processors with Application to Optimal Experimentation in Dynamic Control

    Get PDF
    The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability has lead to its adoption in many non-graphics applications, including wide variety of scientific computing fields. At the same time, a number of important dynamic optimal policy problems in economics are athirst of computing power to help overcome dual curses of complexity and dimensionality. We investigate if computational economics may benefit from new tools on a case study of imperfect information dynamic programming problem with learning and experimentation trade-off that is, a choice between controlling the policy target and learning system parameters. Specifically, we use a model of active learning and control of linear autoregression with unknown slope that appeared in a variety of macroeconomic policy and other contexts. The endogeneity of posterior beliefs makes the problem difficult in that the value function need not be convex and policy function need not be continuous. This complication makes the problem a suitable target for massively-parallel computation using graphics processors. Our findings are cautiously optimistic in that new tools let us easily achieve a factor of 15 performance gain relative to an implementation targeting single-core processors and thus establish a better reference point on the computational speed vs. coding complexity trade-off frontier. While further gains and wider applicability may lie behind steep learning barrier, we argue that the future of many computations belong to parallel algorithms anyway.Graphics Processing Units, CUDA programming, Dynamic programming, Learning, Experimentation

    Accelerating Reconfigurable Financial Computing

    Get PDF
    This thesis proposes novel approaches to the design, optimisation, and management of reconfigurable computer accelerators for financial computing. There are three contributions. First, we propose novel reconfigurable designs for derivative pricing using both Monte-Carlo and quadrature methods. Such designs involve exploring techniques such as control variate optimisation for Monte-Carlo, and multi-dimensional analysis for quadrature methods. Significant speedups and energy savings are achieved using our Field-Programmable Gate Array (FPGA) designs over both Central Processing Unit (CPU) and Graphical Processing Unit (GPU) designs. Second, we propose a framework for distributing computing tasks on multi-accelerator heterogeneous clusters. In this framework, different computational devices including FPGAs, GPUs and CPUs work collaboratively on the same financial problem based on a dynamic scheduling policy. The trade-off in speed and in energy consumption of different accelerator allocations is investigated. Third, we propose a mixed precision methodology for optimising Monte-Carlo designs, and a reduced precision methodology for optimising quadrature designs. These methodologies enable us to optimise throughput of reconfigurable designs by using datapaths with minimised precision, while maintaining the same accuracy of the results as in the original designs

    CPU-GPU hybrid parallel binomial American option pricing

    Get PDF
    We present in this paper a novel parallel binomial algorithm that computes the price of an American option. The algorithm partitions a binomial tree constructed for the pricing into blocks of multiple levels of nodes, and assigns each such block to multiple processors. Each of the processors then computes the option's values at its assigned nodes in two phases. The algorithm is implemented and tested on a heterogeneous system consisting of an Intel multi-core processor and a NVIDIA GPU. The whole task is split and divided over and the CPU and GPU so that the computations are performed on the two processors simultaneously. In the hybrid processing, the GPU is always assigned the last part of a block, and makes use of a couple of buffers in the on-chip shared memory to reduce the number of accesses to the off-chip device memory. The performance of the hybrid processing is compared with an optimised CPU serial code, a CPU parallel implementation and a GPU standalone program.published_or_final_versio

    Massively Parallelized Monte Carlo Simulation and Its Applications in Finance

    Get PDF
    In this paper, we propose, develop and implement a tool that increases the computational speed of exotic derivatives pricing at a fraction of the cost of traditional methods. Our paper focuses on investigating the computing efficiencies of GPU systems. We utilize the GPU’s natural parallelization capabilities to price financial instruments. We outline our implementation, solutions to practical complications arising during implementation and how much faster GPU systems are. Each step that we explore has a significant impact on the efficiency and performance of GPU pricing. Rather than speaking in theoretical, abstract terms, we detail each step in an attempt to give the reader a clear sense of what’s going on. Efficiency is one of the pillars of financial calculations. With the volume of risk calculations mandated by prudent risk management practices, even moderate improvements in calculation efficiency can translate into material changes in trading limits or savings in regulatory capital. This can make the difference between a growing, successful trading operation or an also-ran. Unfortunately, a decent algorithm written in VBA cannot calculate option prices at the same speed as a farm of computers, particularly if we must price the trade in less than 150 milliseconds using 10 million simulation paths. Fast forward from one trade to a book of several hundred thousand trades, many of which are exotic products. Not only is it necessary to price each trade, but we must do so in each of thousands of different market scenarios in order to calculate even basic risk measures such as Greeks and Value-at-Risk (VaR). At the end of the paper, we discuss how GPUs are currently used in the industry and their various advantages, including cost, time, accuracy and calculation frequency. In addition, we discuss the implementation challenges of GPU systems and the attention to detail that is required for memory allocation
    corecore