1,067 research outputs found

    Analysis of scratch-pad and data-cache performance using statistical methods

    Full text link
    Abstract—An effectively designed and efficiently used memory hierarchy, composed of scratch-pads or cache, is seen today as the key to obtaining energy and performance gains in data-dominated embedded applications. However, an unsolved problem is – how to make the right choice between the scratch-pad and the data-cache for different class of applications? Recent studies show that applications with regular and manifest data access patterns (e.g. matrix multiplication) perform better on the scratch-pad compared to the cache. In the case of dynamic applications with irregular and non-manifest access patterns, it is however commonly and intuitively believed that the cache would perform better. In this paper, we show by theoretical analysis and empirical results that this intuition can sometimes be misleading. When access-probabilities remain fixed, we prove that the scratch-pad, with an optimal mapping, will always outperform the cache. We also demonstrate how to map dynamic applications efficiently to scratch-pad or cache and additionally, how to accurately predict the performance. I

    The Melbourne Shuffle: Improving Oblivious Storage in the Cloud

    Full text link
    We present a simple, efficient, and secure data-oblivious randomized shuffle algorithm. This is the first secure data-oblivious shuffle that is not based on sorting. Our method can be used to improve previous oblivious storage solutions for network-based outsourcing of data

    Exploring Processor and Memory Architectures for Multimedia

    Get PDF
    Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU

    Full text link
    The vision of super computer at every desk can be realized by powerful and highly parallel CPUs or GPUs or APUs. Graphics processors once specialized for the graphics applications only, are now used for the highly computational intensive general purpose applications. Very expensive GFLOPs and TFLOP performance has become very cheap with the GPGPUs. Current work focuses mainly on the highly parallel implementation of Matrix Exponentiation. Matrix Exponentiation is widely used in many areas of scientific community ranging from highly critical flight, CAD simulations to financial, statistical applications. Proposed solution for Matrix Exponentiation uses OpenCL for exploiting the hyper parallelism offered by the many core GPGPUs. It employs many general GPU optimizations and architectural specific optimizations. This experimentation covers the optimizations targeted specific to the Scientific Graphics cards (Tesla-C2050). Heterogeneous Highly Parallel Matrix Exponentiation method has been tested for matrices of different sizes and with different powers. The devised Kernel has shown 1000X speedup and 44 fold speedup with the naive GPU Kernel.Comment: 15 pages, 12 figures, International Journal of Distributed and Parallel systems (IJDPS) ISSN : 0976 - 9757 [Online] ; 2229 - 3957 [Print

    Grid computing for the numerical reconstruction of digital holograms

    Get PDF
    Digital holography has the potential to greatly extend holography's applications and move it from the lab into the field: a single CCD or other solid-state sensor can capture any number of holograms while numerical reconstruction within a computer eliminates the need for chemical processing and readily allows further processing and visualisation of the holographic image. The steady increase in sensor pixel count and resolution leads to the possibilities of larger sample volumes and of higher spatial resolution sampling, enabling the practical use of digital off-axis holography. However this increase in pixel count also drives a corresponding expansion of the computational effort needed to numerically reconstruct such holograms to an extent where the reconstruction process for a single depth slice takes significantly longer than the capture process for each single hologram. Grid computing - a recent innovation in largescale distributed processing -provides a convenient means of harnessing significant computing resources in an ad-hoc fashion that might match the field deployment of a holographic instrument. In this paper we consider the computational needs of digital holography and discuss the deployment of numericals reconstruction software over an existing Grid testbed. The analysis of marine organisms is used as an exemplar for work flow and job execution of in-line digital holography

    Embedded Systems Energy Characterization using non-Intrusive Instrumentation

    Get PDF
    Research Report RR2006-37, LIP - ENS LyonThis research report presents a non intrusive methodology for building embedded systems energy consumption models. The method is based on measurement on real hardware in order to get a quantitative approach that takes into account the full architecture. Based on these measurements, data are grouped into class of instructions and events. These classes can then be reused in software simulators and in high-level source code transformation cost functions for optimizing compilers. The computed power model is much more simpler than previous power models while being accurate at the platform level. The methodology is illustrated using experimental results made on an ARM Integrator platform for which an accurate and full system energy model is build

    Safe measurement-based WCET estimation

    Get PDF
    This paper explores the issues to be addressed to provide safe worst-case execution time (WCET) estimation methods based on measurements. We suggest to use structural testing for the exhaustive exploration of paths in a program. Since test data generation is in general too complex to be used in practice for most real-size programs, we propose to generate test data for program segments only, using program clustering. Moreover, to be able to combine execution time of program segments and to obtain the WCET of the whole program, we advocate the use of compiler techniques to reduce (ideally eliminate) the timing variability of program segments and to make the time of program segments independent from one another

    A composable, energy-managed, real-time MPSOC platform.

    Get PDF
    Multi-processors systems on chip (MPSOC) platforms emerged in embedded systems as hardware solutions to support the continuously increasing functionality and performance demands in this domain. Such a platform has to execute a mix of applications with diverse performance and timing constraints, i.e., real-time or non-real-time, thus different application schedulers should co-exist on an MPSOC. Moreover, applications share many MPSOC resources, thus their timing depends on the arbitration at these resources. Arbitration may create inter-application dependencies, e.g., the timing of a low priority application depends on the timing of all higher priority ones. Application inter-dependencies make the functional and timing verification and the integration process harder. This is especially problematic for real-time applications, for which fulfilling the time-related constraints should be guaranteed by construction. Moreover, energy and power management, commonly employed in embedded systems, make this verification even more difficult. Typically, energy and power management involves scaling the resources operating point, which has a direct impact on the resource performance, thus influences the application time behaviour. Finally, a small change in one application leads to the need to re-verify all other applications, incurring a large effort. Composability is a property meant to ease the verification and integration process. A system is composable if the functionality and the timing behaviour of each application is independent of other applications mapped on the same platform. Composability is achieved by utilising arbiters that ensure applications independence. In this paper we present the concepts behind a composable, scalable, energy-managed MPSOC platform, able to support different real-time and nonreal time schedulers concurrently, and discuss its advantages and limitations
    • …
    corecore