9,508 research outputs found

    Fuzzy memoization for floating-point multimedia applications

    Get PDF
    Instruction memoization is a promising technique to reduce the power consumption and increase the performance of future low-end/mobile multimedia systems. Power and performance efficiency can be improved by reusing instances of an already executed operation. Unfortunately, this technique may not always be worth the effort due to the power consumption and area impact of the tables required to leverage an adequate level of reuse. In this paper, we introduce and evaluate a novel way of understanding multimedia floating-point operations based on the fuzzy computation paradigm: performance and power consumption can be improved at the cost of small precision losses in computation. By exploiting this implicit characteristic of multimedia applications, we propose a new technique called tolerant memoization. This technique expands the capabilities of classic memoization by associating entries with similar inputs to the same output. We evaluate this new technique by measuring the effect of tolerant memoization for floating-point operations in a low-power multimedia processor and discuss the trade-offs between performance and quality of the media outputs. We report energy improvements of 12 percent for a set of key multimedia applications with small LUT of 6 Kbytes, compared to 3 percent obtained using previously proposed techniques.Peer ReviewedPostprint (published version

    Computational and Energy Costs of Cryptographic Algorithms on Handheld Devices

    Get PDF
    Networks are evolving toward a ubiquitous model in which heterogeneous devices are interconnected. Cryptographic algorithms are required for developing security solutions that protect network activity. However, the computational and energy limitations of network devices jeopardize the actual implementation of such mechanisms. In this paper, we perform a wide analysis on the expenses of launching symmetric and asymmetric cryptographic algorithms, hash chain functions, elliptic curves cryptography and pairing based cryptography on personal agendas, and compare them with the costs of basic operating system functions. Results show that although cryptographic power costs are high and such operations shall be restricted in time, they are not the main limiting factor of the autonomy of a device

    Fabric defect detection using the wavelet transform in an ARM processor

    Get PDF
    Small devices used in our day life are constructed with powerful architectures that can be used for industrial applications when requiring portability and communication facilities. We present in this paper an example of the use of an embedded system, the Zeus epic 520 single board computer, for defect detection in textiles using image processing. We implement the Haar wavelet transform using the embedded visual C++ 4.0 compiler for Windows CE 5. The algorithm was tested for defect detection using images of fabrics with five types of defects. An average of 95% in terms of correct defect detection was obtained, achieving a similar performance than using processors with float point arithmetic calculations

    Caching in real-time and embedded systems and Benchmarking the ARM Cortex-M3 and Quark x1000 proccessors

    Get PDF
    The general goal is to compare performance of two processors for the low-end embedded market: Intel Quark x1000 vs. ARM Cortex M3, with special emphasis in the memory hierarchy. To do that, first we will assess the cache potential varying sizes, associativities and line sizes by means of CACTI, a cache modeling tool. Then we will review relevant research literature to conclude about the importance and possibilities of the memory hierarchy in real-time embedded systems. Finally, we will write an specific benchmark suite, using it to test the two referenced processors

    Development of an oceanographic application in HPC

    Get PDF
    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes
    corecore