Search CORE

97 research outputs found

Main memory and cache performance of Intel Sandy

Author: Daniel Hackenberg
Daniel Molka
Robert Schöne
Publication venue
Publication date: 01/01/2014
Field of study

Abstract Application performance on multicore processors is seldom constrained by the speed of floating point or integer units. Much more often, limitations are caused by the memory subsystem, particularly shared resources such as last level caches or memory controllers. Measuring, predicting and modeling memory performance becomes a steeper challenge with each new processor generation due to the growing complexity and core count. We tackle the important aspect of measuring and understanding undocumented memory performance numbers in order to create valuable insight into microprocessor details. For this, we build upon a set of sophisticated benchmarks that support latency and bandwidth measurements to arbitrary locations in the memory subsystem. These benchmarks are extended to support AVX instructions for bandwidth measurements and to integrate the coherence states (O)wned and (F)orward. We then use these benchmarks to perform an indepth analysis of current ccNUMA multiprocessor systems with Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using our benchmarks we present fundamental memory performance data and illustrate performance-relevant architectural properties of both designs

CiteSeerX

Low level conditional move optimization

Author: Antyipin Artyom
Góbi Attila
Kozsik Tamás
Publication venue
Publication date: 01/01/2013
Field of study

The high level optimizations are becoming more and more sophisticated, the importance of low level optimizations should not be underestimated. Due to the changes in the inner architecture of modern processors, some optimization techniques may become more or less effective. Existing techniques need, from time to time, to be reconsidered, and new techniques, targeting these modern architectures, may emerge. Due to the growing instruction pipeline of modern processors, recovering after branch mis-predictions is becoming more expensive, and so avoiding that is becoming more critical. In this paper we introduce a novel approach to branch elimination using conditional move operations, namely the CMOVcc instruction group. The inappropriate use of these instructions may result in sensible performance regression, but in many cases they outperform the sequence of a conditional jump and an unconditional move instruction. Our goal is to analyze the usage of CMOVcc in different contexts on modern processors, and based on these results, propose a technique to automatically decide whether the conditional move or the sequence of a conditional jump and an unconditional move should be performed in a given situation

University of Szeged

ELTE Digital Institutional Repository (EDIT)

Analysis of Intel's Haswell Microarchitecture Using The ECM Model and Microbenchmarks

Author: Eitzinger Jan
Fey Dietmar
Hager Georg
Hofmann Johannes
Wellein Gerhard
Publication venue
Publication date: 13/11/2015
Field of study

This paper presents an in-depth analysis of Intel's Haswell microarchitecture for streaming loop kernels. Among the new features examined is the dual-ring Uncore design, Cluster-on-Die mode, Uncore Frequency Scaling, core improvements as new and improved execution units, as well as improvements throughout the memory hierarchy. The Execution-Cache-Memory diagnostic performance model is used together with a generic set of microbenchmarks to quantify the efficiency of the microarchitecture. The set of microbenchmarks is chosen such that it can serve as a blueprint for other streaming loop kernels.Comment: arXiv admin note: substantial text overlap with arXiv:1509.0311

arXiv.org e-Print Archive

Crossref

Parallelizing remote sensing image geometric correction

Author: Bernabeu i Altayó Gerard
Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius
Universitat Autònoma de Barcelona. Escola d'Enginyeria
Publication venue
Publication date: 01/01/2012
Field of study

Remote sensing spatial, spectral, and temporal resolutions of images, acquiredLes resolucions espacials, espectrals i temporals d'imatges de teledetecci ó, adquirides a una mida raonable, donen com a resultat imatges que es poden processar per a representar grans àrees de terreny amb un nivell de detall espacial que esLas resoluciones espaciales, espectrales y temporales de imágenes de teledetección, adquiridas a un tamaño razonable, dan como resultado imágenes que se pueden procesar para representar grandes áreas de terreno con un nivel de detalle espacial que es muy atractivo para la observación y la gestión, así como para actividades científicas. Con la ley de Moore aún vigente, más y más paralelismo es introducido a cada nueva generación para todas las plataformas de computación. El paralelismo está presente en todos los niveles de integración y en la programación, con el fin de obtener un mayor rendimiento y eficiencia energética. Siendo el proceso de calibración geométrica uno de los procesos más costosos computacional y temporalmente cuando utilizamos imágenes de teledetección, el objetivo de este trabajo es acelerar este proceso mediante el aprovechamiento de las nuevas tecnologías y arquitecturas de computación, haciendo especial hincapié en la explotación de hardware paralelo con memoria compartida. Mediante el uso de directivas OpenMP se ha paralelizado la etapa más costosa y lenta del proceso de corrección geométrica de imágenes de teledetección. Este trabajo compara el rendimiento de la aplicación original serie con la versión paralelizada mediante pruebas en distintos sistemas multiprocesador, proponiendo varios enfoques para escoger el hardware más adecuado para una ejecución óptima

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Diposit Digital de Documents de la UAB

Assessing malware detection using hardware performance counters

Author: Gupta Anmol Brijesh
Publication venue
Publication date: 02/11/2017
Field of study

Despite the use of modern anti-virus (AV) software, malware is a prevailing threat to today's computing systems. AV software cannot cope with the increasing number of evasive malware, calling for more robust malware detection techniques. Out of the many proposed methods for malware detection, researchers have suggested microarchitecture-based mechanisms for detection of malicious software in a system. For example, Intel embeds a shadow stack in their modern architectures that maintains the integrity between function calls and their returns by tracking the function's return address. Any malicious program that exploits an application to overflow the return addresses can be restrained using the shadow stack. Researchers also propose the use of Hardware Performance Counters (HPCs). HPCs are counters embedded in modern computing architectures that count the occurrence of architectural events, such as cache hits, clock cycles, and integer instructions. Malware detectors that leverage HPCs create a profile of an application by reading the counter values periodically. Subsequently, researchers use supervised machine learning-based (ML) classification techniques to differentiate malicious profiles amongst benign ones. It is important to note that HPCs count the occurrence of microarchitectural events during execution of the program. However, whether a program is malicious or benign is the high-level behavior of a program. Since HPCs do not surveil the high-level behavior of an application, we hypothesize that the counters may fail to capture the difference in the behavioral semantics of a malicious and benign software. To investigate whether HPCs capture the behavioral semantics of the program, we recreate the experimental setup from the previously proposed systems. To this end, we leverage HPCs to profile applications such as MS-Office and Chrome as benign applications and known malware binaries as malicious applications. Standard ML classifiers demand a normally distributed dataset, where the variance is independent of the mean of the data points. To transform the profile into more normal-like distribution and to avoid over-fitting the machine learning models, we employ power transform on the profiles of the applications. Moreover, HPCs can monitor a broad range of hardware-based events. We use Principal Component Analysis (PCA) for selecting the top performance events that show maximum variation in the least number of features amongst all the applications profiled. Finally, we train twelve supervised machine learning classifiers such as Support Vector Machine (SVM) and MultiLayer Perceptron (MLPs) on the profiles from the applications. We model each classifier as a binary classifier, where the two classes are 'Benignware' and 'Malware.' Our results show that for the 'Malware' class, the average recall and F2-score across the twelve classifiers is 0.22 and 0.70 respectively. The low recall score shows that the ML classifiers tag malware as benignware. Even though we exercise a statistical approach for selecting our features, the classifiers are not able to distinguish between malware and benignware based on the hardware-based events monitored by the HPCs. The incapability of the profiles from HPCs in capturing the behavioral characteristic of an application force us to question the use of HPCs as malware detectors

Boston University Institutional Repository (OpenBU)

Implementing BLAKE with AVX, AVX2, and XOP

Author: Jean-Philippe Aumasson
Samuel Neves
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 29/05/2012
Field of study

In 2013 Intel will release the AVX2 instructions, which introduce 256-bit single-instruction multiple-data (SIMD) integer arithmetic. This will enable desktop and server processors from this vendor to support 4-way SIMD computation of 64-bit add-rotate-xor algorithms, as well as 8-way 32-bit SIMD computations. AVX2 also includes interesting instructions for cryptographic functions, like any-to-any permute and vectorized table-lookup. In this paper, we explore the potential of AVX2 to speed-up the SHA-3 finalist BLAKE, and present the first working assembly implementations of BLAKE-256 and BLAKE-512 with AVX2. We then investigate the potential of the recent AVX and XOP instructions to accelerate BLAKE, and report new speed records on Sandy Bridge and Bulldozer microarchitectures (7.47 and 11.64 cycles per byte for BLAKE-256, 5.71 and 6.95 for BLAKE-512)

Cryptology ePrint Archive

Optimization of molecular dynamics simulation code and applications to biomolecular systems

Author: Bowman David M.
Publication venue
Publication date: 01/01/2015
Field of study

Tese de doutoramento, Bioquimica, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2015The performance of molecular dynamics (MD) software such as GROMACS is limited by the software’s ability to perform force calculations. The largest part of this is for nonbonded interactions such as between water molecules and water molecules and solute. The determination of nonbonded interactions may account for over 90% of the total computation and real time of a simulation. The objective of this project is to greatly improve the performance of force calculations for nonbonded on a single core/processor. By doing this it is possible to raise the bar on all simulations that can be performed by GROMACS (single, multi-core or MPI). The resulting modifications need to then be verified to determine that the software still works. That it is still ‘good enough’ for performing molecular dynamics simulations.Virtual Strategy, Inc., Boston, M

Sapientia