97 research outputs found

    Main memory and cache performance of Intel Sandy

    Get PDF
    Abstract Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs

    Low level conditional move optimization

    Get PDF
    The high level optimizations are becoming more and more sophisticated, the importance of low level optimizations should not be underestimated. Due to the changes in the inner architecture of modern processors, some optimization techniques may become more or less effective. Existing techniques need, from time to time, to be reconsidered, and new techniques, targeting these modern architectures, may emerge. Due to the growing instruction pipeline of modern processors, recovering after branch mis-predictions is becoming more expensive, and so avoiding that is becoming more critical. In this paper we introduce a novel approach to branch elimination using conditional move operations, namely the CMOVcc instruction group. The inappropriate use of these instructions may result in sensible performance regression, but in many cases they outperform the sequence of a conditional jump and an unconditional move instruction. Our goal is to analyze the usage of CMOVcc in different contexts on modern processors, and based on these results, propose a technique to automatically decide whether the conditional move or the sequence of a conditional jump and an unconditional move should be performed in a given situation

    Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks

    Full text link
    This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used together with a generic set of microbenchmarks to quantify the efficiency of the microarchitecture. The set of microbenchmarks is chosen such that it can serve as a blueprint for other streaming loop kernels.Comment: arXiv admin note: substantial text overlap with arXiv:1509.0311

    Parallelizing remote sensing image geometric correction

    Get PDF
    Remote sensing spatial, spectral, and temporal resolutions of images, acquiredLes resolucions espacials, espectrals i temporals d'imatges de teledetecci ó, adquirides a una mida raonable, donen com a resultat imatges que es poden processar per a representar grans àrees de terreny amb un nivell de detall espacial que esLas resoluciones espaciales, espectrales y temporales de imágenes de teledetección, adquiridas a un tamaño razonable, dan como resultado imágenes que se pueden procesar para representar grandes áreas de terreno con un nivel de detalle espacial que es muy atractivo para la observación y la gestión, así como para actividades científicas. Con la ley de Moore aún vigente, más y más paralelismo es introducido a cada nueva generación para todas las plataformas de computación. El paralelismo está presente en todos los niveles de integración y en la programación, con el fin de obtener un mayor rendimiento y eficiencia energética. Siendo el proceso de calibración geométrica uno de los procesos más costosos computacional y temporalmente cuando utilizamos imágenes de teledetección, el objetivo de este trabajo es acelerar este proceso mediante el aprovechamiento de las nuevas tecnologías y arquitecturas de computación, haciendo especial hincapié en la explotación de hardware paralelo con memoria compartida. Mediante el uso de directivas OpenMP se ha paralelizado la etapa más costosa y lenta del proceso de corrección geométrica de imágenes de teledetección. Este trabajo compara el rendimiento de la aplicación original serie con la versión paralelizada mediante pruebas en distintos sistemas multiprocesador, proponiendo varios enfoques para escoger el hardware más adecuado para una ejecución óptima

    Assessing malware detection using hardware performance counters

    Get PDF
    Despite the use of modern anti-virus (AV) software, malware is a prevailing threat to today's computing systems. AV software cannot cope with the increasing number of evasive malware, calling for more robust malware detection techniques. Out of the many proposed methods for malware detection, researchers have suggested microarchitecture-based mechanisms for detection of malicious software in a system. For example, Intel embeds a shadow stack in their modern architectures that maintains the integrity between function calls and their returns by tracking the function's return address. Any malicious program that exploits an application to overflow the return addresses can be restrained using the shadow stack. Researchers also propose the use of Hardware Performance Counters (HPCs). HPCs are counters embedded in modern computing architectures that count the occurrence of architectural events, such as cache hits, clock cycles, and integer instructions. Malware detectors that leverage HPCs create a profile of an application by reading the counter values periodically. Subsequently, researchers use supervised machine learning-based (ML) classification techniques to differentiate malicious profiles amongst benign ones. It is important to note that HPCs count the occurrence of microarchitectural events during execution of the program. However, whether a program is malicious or benign is the high-level behavior of a program. Since HPCs do not surveil the high-level behavior of an application, we hypothesize that the counters may fail to capture the difference in the behavioral semantics of a malicious and benign software. To investigate whether HPCs capture the behavioral semantics of the program, we recreate the experimental setup from the previously proposed systems. To this end, we leverage HPCs to profile applications such as MS-Office and Chrome as benign applications and known malware binaries as malicious applications. Standard ML classifiers demand a normally distributed dataset, where the variance is independent of the mean of the data points. To transform the profile into more normal-like distribution and to avoid over-fitting the machine learning models, we employ power transform on the profiles of the applications. Moreover, HPCs can monitor a broad range of hardware-based events. We use Principal Component Analysis (PCA) for selecting the top performance events that show maximum variation in the least number of features amongst all the applications profiled. Finally, we train twelve supervised machine learning classifiers such as Support Vector Machine (SVM) and MultiLayer Perceptron (MLPs) on the profiles from the applications. We model each classifier as a binary classifier, where the two classes are 'Benignware' and 'Malware.' Our results show that for the 'Malware' class, the average recall and F2-score across the twelve classifiers is 0.22 and 0.70 respectively. The low recall score shows that the ML classifiers tag malware as benignware. Even though we exercise a statistical approach for selecting our features, the classifiers are not able to distinguish between malware and benignware based on the hardware-based events monitored by the HPCs. The incapability of the profiles from HPCs in capturing the behavioral characteristic of an application force us to question the use of HPCs as malware detectors

    Implementing BLAKE with AVX, AVX2, and XOP

    Get PDF
    In 2013 Intel will release the AVX2 instructions, which introduce 256-bit single-instruction multiple-data (SIMD) integer arithmetic. This will enable desktop and server processors from this vendor to support 4-way SIMD computation of 64-bit add-rotate-xor algorithms, as well as 8-way 32-bit SIMD computations. AVX2 also includes interesting instructions for cryptographic functions, like any-to-any permute and vectorized table-lookup. In this paper, we explore the potential of AVX2 to speed-up the SHA-3 finalist BLAKE, and present the first working assembly implementations of BLAKE-256 and BLAKE-512 with AVX2. We then investigate the potential of the recent AVX and XOP instructions to accelerate BLAKE, and report new speed records on Sandy Bridge and Bulldozer microarchitectures (7.47 and 11.64 cycles per byte for BLAKE-256, 5.71 and 6.95 for BLAKE-512)

    Optimization of molecular dynamics simulation code and applications to biomolecular systems

    Get PDF
    Tese de doutoramento, Bioquimica, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2015The performance of molecular dynamics (MD) software such as GROMACS is limited by the software’s ability to perform force calculations. The largest part of this is for nonbonded interactions such as between water molecules and water molecules and solute. The determination of nonbonded interactions may account for over 90% of the total computation and real time of a simulation. The objective of this project is to greatly improve the performance of force calculations for nonbonded on a single core/processor. By doing this it is possible to raise the bar on all simulations that can be performed by GROMACS (single, multi-core or MPI). The resulting modifications need to then be verified to determine that the software still works. That it is still ‘good enough’ for performing molecular dynamics simulations.Virtual Strategy, Inc., Boston, M
    corecore