135 research outputs found

    Computing the fast Fourier transform on SIMD microprocessors

    Get PDF
    This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

    SIMD-Swift: Improving Performance of Swift Fault Detection

    Get PDF
    The general tendency in modern hardware is an increase in fault rates, which is caused by the decreased operation voltages and feature sizes. Previously, the issue of hardware faults was mainly approached only in high-availability enterprise servers and in safety-critical applications, such as transport or aerospace domains. These fields generally have very tight requirements, but also higher budgets. However, as fault rates are increasing, fault tolerance solutions are starting to be also required in applications that have much smaller profit margins. This brings to the front the idea of software-implemented hardware fault tolerance, that is, the ability to detect and tolerate hardware faults using software-based techniques in commodity CPUs, which allows to get resilience almost for free. Current solutions, however, are lacking in performance, even though they show quite good fault tolerance results. This thesis explores the idea of using the Single Instruction Multiple Data (SIMD) technology for executing all program\'s operations on two copies of the same data. This idea is based on the observation that SIMD is ubiquitous in modern CPUs and is usually an underutilized resource. It allows us to detect bit-flips in hardware by a simple comparison of two copies under the assumption that only one copy is affected by a fault. We implemented this idea as a source-to-source compiler which performs hardening of a program on the source code level. The evaluation of our several implementations shows that it is beneficial to use it for applications that are dominated by arithmetic or logical operations, but those that have more control-flow or memory operations are actually performing better with the regular instruction replication. For example, we managed to get only 15% performance overhead on Fast Fourier Transformation benchmark, which is dominated by arithmetic instructions, but memory-access-dominated Dijkstra algorithm has shown a high overhead of 200%

    Near Deterministic Signal Processing Using GPU, DPDK, and MKL

    Get PDF
    RÉSUMÉ En radio dĂ©fnie par logiciel, le traitement numcrique du signal impose le traitement en temps rĂ©el des donnĂ©s et des signaux. En outre, dans le dĂ©veloppement de systĂšmes de communication sans fil basĂ©es sur la norme dite Long Term Evolution (LTE), le temps rĂ©el et une faible latence des processus de calcul sont essentiels pour obtenir une bonne experience utilisateur. De plus, la latence des calculs est une clĂ© essentielle dans le traitement LTE, nous voulons explorer si des unitĂ©s de traitement graphique (GPU) peuvent ĂȘtre utilisĂ©es pour accĂ©lĂ©rer le traitement LTE. Dans ce but, nous explorons la technologie GPU de NVIDIA en utilisant le modĂ©le de programmation Compute Unified Device Architecture (CUDA) pour rĂ©duire le temps de calcul associĂ© au traitement LTE. Nous prĂ©sentons briĂ©vement l'architecture CUDA et le traitement parallĂ©le avec GPU sous Matlab, puis nous comparons les temps de calculs avec Matlab et CUDA. Nous concluons que CUDA et Matlab accĂ©lĂ©rent le temps de calcul des fonctions qui sont basĂ©es sur des algorithmes de traitement en parallĂ©le et qui ont le mĂȘme type de donnĂ©es, mais que cette accĂ©lĂ©ration est fortement variable en fonction de l'algorithme implantĂ©. Intel a proposĂ© une boite Ă  outil pour le dĂ©veloppement de plan de donnĂ©es (DPDK) pour faciliter le dĂ©veloppement des logiciels de haute performance pour le traitement des fonctionnalitĂ©s de tĂ©lĂ©communication. Dans ce projet, nous explorons son utilisation ainsi que celle de l'isolation du systĂšme d'exploitation pour rĂ©duire la variabilitĂ© des temps de calcul des processus de LTE. Plus prĂ©cisĂ©ment, nous utilisons DPDK avec la Math Kernel Library (MKL) pour calculer la transformĂ©e de Fourier rapide (FFT) associĂ©e avec le processus LTE et nous mesurons leur temps de calcul. Nous Ă©valuons quatre cas: 1) code FFT dans le cƓur esclave sans isolation du CPU, 2) code FFT dans le cƓur esclave avec l'isolation du CPU, 3) code FFT utilisant MKL sans DPDK et 4) code FFT de base. Nous combinons DPDK et MKL pour les cas 1 et 2 et Ă©valuons quel cas est plus dĂ©terministe et rĂ©duit le plus la latence des processus LTE. Nous montrons que le temps de calcul moyen pour la FFT de base est environ 100 fois plus grand alors que l'Ă©cart-type est environ 20 fois plus Ă©levĂ©. On constate que MKL offre d'excellentes performances, mais comme il n'est pas extensible par lui-mĂȘme dans le domaine infonuagique, le combiner avec DPDK est une alternative trĂšs prometteuse. DPDK permet d'amĂ©liorer la performance, la gestion de la mĂ©moire et rend MKL Ă©volutif.----------ABSTRACT In software defined radio, digital signal processing requires strict real time processing of data and signals. Specifically, in the development of the Long Term Evolution (LTE) standard, real time and low latency of computation processes are essential to obtain good user experience. As low latency computation is critical in real time processing of LTE, we explore the possibility of using Graphics Processing Units (GPUs) to accelerate its functions. As the first contribution of this thesis, we adopt NVIDIA GPU technology using the Compute Unified Device Architecture (CUDA) programming model in order to reduce the computation times of LTE. Furthermore, we investigate the efficiency of using MATLAB for parallel computing on GPUs. This allows us to evaluate MATLAB and CUDA programming paradigms and provide a comprehensive comparison between them for parallel computing of LTE processes on GPUs. We conclude that CUDA and Matlab accelerate processing of structured basic algorithms but that acceleration is variable and depends which algorithm is involved. Intel has proposed its Data Plane Development Kit (DPDK) as a tool to develop high performance software for processing of telecommunication data. As the second contribution of this thesis, we explore the possibility of using DPDK and isolation of operating system to reduce the variability of the computation times of LTE processes. Specifically, we use DPDK along with the Math Kernel Library (MKL) provided by Intel to calculate Fast Fourier Transforms (FFT) associated with LTE processes and measure their computation times. We study the computation times in different scenarios where FFT calculation is done with and without the isolation of processing units along the use of DPDK. Our experimental analysis shows that when DPDK and MKL are simultaneously used and the processing units are isolated, the resulting processing times of FFT calculation are reduced and have a near-deterministic characteristic. Explicitly, using DPDK and MKL along with the isolation of processing units reduces the mean and standard deviation of processing times for FFT calculation by 100 times and 20 times, respectively. Moreover, we conclude that although MKL reduces the computation time of FFTs, it does not offer a scalable solution but combining it with DPDK is a promising avenue
    • 

    corecore