10 research outputs found

    Performance analysis of a millimeter wave MIMO channel estimation method in an embedded multi-core processor

    Get PDF
    The emerging Multi-Processor System-on-Chip (MPSoC) technology, which combines heterogeneous computing with the high performance of field programmable gate arrays (FPGA), is a promising platform for a large number of applications, including wireless communications and vehicular technology. In this specific application context, when multiple-input multiple-output (MIMO) scenarios are considered, the system usually has to manage a large number of communication links among sensors and antennas involving different vehicles and users. Millimeter wave (mmWave) communications are one of the key technology enablers toward achieving high data rates in beyond 5G systems (B5G). Communication at these frequency bands usually involves the use of large antenna arrays, often requiring high computational resources. One of the candidate platforms able to manage a huge number of communications is the Xilinx Zynq UltraScale+ EG Heterogeneous MPSoC, which is composed of a dual-core Cortex-R5, a quad-core ARM Cortex-A53, a graphics processing unit (GPU) and a high-end FPGA. This work analyzes the computational performance that requires a recent mmWave MIMO channel estimation algorithm in a platform of this kind. As a first approach, we will focus our work on the performance that can be achieved via the quad-core ARM Cortex-A53. To this end, we will use the libraries for numerical algebra (BLAS and LAPACK). The results show that our reference implementation is able to manage a large MIMO communication system with 256 antennas without exhausting platform resources.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Thanks to Grant PID2020-113785RB-100 funded by MCIN/AEI/1013039/ 501100011033 and the Ramón y Cajal Grant RYC-2017-22101. The work has been also supported by the Spanish Ministry of Science and Innovation under Grants RTI2018-097045-B-C21, PID2019-106455GB-C21 and PID2020-113656RB-C21, as well as the Regional Government of Madrid throughout the projects MIMACUHSPACE-CM-UC3M (2022/00024/001) and PEJD-2019-PRE/TIC-16327

    Hybrid CPU-GPU implementation of the transformed spatial domain channel estimation algorithm for mmWave MIMO systems

    Get PDF
    Hybrid platforms combining multicore central processing units (CPU) with manycore hardware accelerators such as graphic processing units (GPU) can be smartly exploited to provide efcient parallel implementations of wireless communication algorithms for Fifth Generation (5G) and beyond systems. Massive multiple-input multiple-output (MIMO) systems are a key element of the 5G standard, involving several tens or hundreds of antenna elements for communication. Such a high number of antennas has a direct impact on the computational complexity of some MIMO signal processing algorithms. In this work, we focus on the channel estimation stage. In particular, we develop a parallel implementation of a recently proposed MIMO channel estimation algorithm. Its performance in terms of execution time is evaluated both in a multicore CPU and in a GPU. The results show that some computation blocks of the algorithm are more suitable for multicore implementation, whereas other parts are more efciently implemented in the GPU, indicating that a hybrid CPU-GPU implementation would achieve the best performance in practical applications based on the tested platform

    Parallel SUMIS Soft Detector for Large MIMO Systems on Multicore and GPU

    Get PDF
    [EN] The number of transmit and receiver antennas is an important factor that affects the performance and complexity of a MIMO system. A MIMO system with very large number of antennas is a promising candidate technology for next generations of wireless systems. However, the vast majority of the methods proposed for conventional MIMO system are not suitable for large dimensions. In this context, the use of high-performance computing systems, such us multicore CPUs and graphics processing units has become attractive for efficient implementation of parallel signal processing algorithms with high computational requirements. In the present work, two practical parallel approaches of the Subspace Marginalization with Interference Suppression detector for large MIMO systems have been proposed. Both approaches have been evaluated and compared in terms of performance and complexity with other detectors for different system parameters.This work has been partially supported by the Spanish MINECO Grant RACHEL TEC2013-47141-C4-4-R, the PROMETEO FASE II 2014/003 Project and FPU AP-2012/71274Ramiro Sánchez, C.; Simarro, MA.; Gonzalez, A.; Vidal Maciá, AM. (2019). Parallel SUMIS Soft Detector for Large MIMO Systems on Multicore and GPU. The Journal of Supercomputing. 75(3):1256-1267. https://doi.org/10.1007/s11227-018-2403-9S12561267753Rusek F, Persson D, Lau BK, Larsson EG, Marzetta TL, Edfors O, Tufvesson F (2013) Scaling up MIMO: opportunities and challenges with very large arrays. IEEE Signal Proc Mag 30(1):40–60Studer C, Burg A, Bölcskei H (2008) Soft-output sphere decoding: algorithms and VLSI implementation. IEEE J Sel Areas Commun 26(2):290–300Wang R, Giannakis GB (2004) Approaching MIMO channel capacity with reduced-complexity soft sphere decoding. In: Wireless Communications and Networking Conference, 2004. WCNC. 2004 IEEE vol 3, pp 1620–1625Persson D, Larsson EG (2011) Partial marginalization soft MIMO detection with higher order constellations. IEEE Trans Signal Procces 59(1):453–458Cîrkić M, Larsson EG (2014) SUMIS: near-optimal soft-in soft-out MIMO detection with low and fixed complexity. IEEE Trans Signal Process 62(12):3084–3097Alberto Gonzalez C, Ramiro, M, Ángeles Simarro, Antonio M Vidal (2017) Parallel SUMIS soft detector for MIMO systems on multicore. In: Proceedings of the 17th International Conference on Computational and Mathematical Methods in Science and Engineering, pp 1729–1736Hochwald BM, ten Brink S (2003) Achieving near-capacity on a multiple-antenna channel. IEEE Trans Commun 51:389–399Kaipeng L, Bei Y, Michael W, Joseph RC, Christoph S (2015) Accelerating massive MIMO uplink detection on GPU for SDR systems. In: 2015 IEEE dallas circuits and systems conference (DCAS), pp 1–4Di W, Eilert J, Liu D (2011) Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. J Signal Process Syst 63(1):27–37Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Sorensen D (1999) LAPACK users’ guide. SIAM, LondonIntel MKL Reference Manual (2015) https://software.intel.com/en-us/articles/mkl-reference-manualcuBLAS Documentation (2015) http://docs.nvidia.com/cuda/cublasDagum L, Enon R (1998) OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 5(1):46–55CUDA Toolkit Documentation, Version 7.5 (2015) https://developer.nvidia.com/cuda-toolkitRoger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) Fully parallel GPU implementation of a fixed-complexity soft-output MIMO detector. IEEE Trans Veh Technol 61(8):3796–3800Senst M, Ascheid G, Lüders H (2010) Performance evaluation of the markov chain monte carlo MIMO detector based on mutual information. 2010 IEEE International Conference on Communications (ICC), pp 1–

    FlexCore: Massively Parallel and Flexible Processing for Large MIMO Access Points

    Get PDF
    Large MIMO base stations remain among wireless network designers’ best tools for increasing wireless throughput while serving many clients, but current system designs, sacrifice throughput with simple linear MIMO detection algorithms. Higher-performance detection techniques are known, but remain off the table because these systems parallelize their computation at the level of a whole OFDM subcarrier, sufficing only for the less demanding linear detection approaches they opt for. This paper presents FlexCore, the first computational architecture capable of parallelizing the detection of large numbers of mutually-interfering information streams at a granularity below individual OFDM subcarriers, in a nearly-embarrassingly parallel manner while utilizing any number of available processing elements. For 12 clients sending 64-QAM symbols to a 12-antenna base station, our WARP testbed evaluation shows similar network throughput to the state-of-the-art while using an order of magnitude fewer processing elements. For the same scenario, our combined WARP-GPU testbed evaluation demonstrates a 19x computational speedup, with 97% increased energy efficiency when compared with the state of the art. Finally, for the same scenario, an FPGA-based comparison between FlexCore and the state of the art shows that FlexCore can achieve up to 96% better energy efficiency, and can offer up to 32x the processing throughput

    Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector

    Full text link
    Multicore and graphic processing units (GPUs) can be combined to efficiently implement signal-processing algorithms for communication systems, due to their parallel processing capabilities. This paper proposes a fully parallel fixed-complexity soft-output detector, which is suitable for GPU implementation and allows a considerable decrease in the computational time required for the data detection stage in multiple-input-multiple-output (MIMO) systems. A novel channel matrix preprocessing stage, based on column-norm ordering, is developed to efficiently match the multicore architecture. The throughput of the implementation is shown to outperform other recent implementations and to support some of the configurations in the long-term evolution (LTE) standard.This work was supported in part by the PROMETEO/2009/013 and TEC2009-13741 Projects and in part by the AP2007-01417 FPU Ph.D. grant. The review of this paper was coordinated by Prof. Y. Su.Roger Varea, S.; Ramiro Sánchez, C.; González Salvador, A.; Almenar Terré, V.; Vidal Maciá, AM. (2012). Fully Parallel GPU Implementation of a Fixed-Complexity Soft-Output MIMO Detector. IEEE Transactions on Vehicular Technology. 61(8):3796-3800. https://doi.org/10.1109/TVT.2012.2210576S3796380061

    Near Deterministic Signal Processing Using GPU, DPDK, and MKL

    Get PDF
    RÉSUMÉ En radio défnie par logiciel, le traitement numcrique du signal impose le traitement en temps réel des donnés et des signaux. En outre, dans le développement de systèmes de communication sans fil basées sur la norme dite Long Term Evolution (LTE), le temps réel et une faible latence des processus de calcul sont essentiels pour obtenir une bonne experience utilisateur. De plus, la latence des calculs est une clé essentielle dans le traitement LTE, nous voulons explorer si des unités de traitement graphique (GPU) peuvent être utilisées pour accélérer le traitement LTE. Dans ce but, nous explorons la technologie GPU de NVIDIA en utilisant le modéle de programmation Compute Unified Device Architecture (CUDA) pour réduire le temps de calcul associé au traitement LTE. Nous présentons briévement l'architecture CUDA et le traitement paralléle avec GPU sous Matlab, puis nous comparons les temps de calculs avec Matlab et CUDA. Nous concluons que CUDA et Matlab accélérent le temps de calcul des fonctions qui sont basées sur des algorithmes de traitement en paralléle et qui ont le même type de données, mais que cette accélération est fortement variable en fonction de l'algorithme implanté. Intel a proposé une boite à outil pour le développement de plan de données (DPDK) pour faciliter le développement des logiciels de haute performance pour le traitement des fonctionnalités de télécommunication. Dans ce projet, nous explorons son utilisation ainsi que celle de l'isolation du système d'exploitation pour réduire la variabilité des temps de calcul des processus de LTE. Plus précisément, nous utilisons DPDK avec la Math Kernel Library (MKL) pour calculer la transformée de Fourier rapide (FFT) associée avec le processus LTE et nous mesurons leur temps de calcul. Nous évaluons quatre cas: 1) code FFT dans le cœur esclave sans isolation du CPU, 2) code FFT dans le cœur esclave avec l'isolation du CPU, 3) code FFT utilisant MKL sans DPDK et 4) code FFT de base. Nous combinons DPDK et MKL pour les cas 1 et 2 et évaluons quel cas est plus déterministe et réduit le plus la latence des processus LTE. Nous montrons que le temps de calcul moyen pour la FFT de base est environ 100 fois plus grand alors que l'écart-type est environ 20 fois plus élevé. On constate que MKL offre d'excellentes performances, mais comme il n'est pas extensible par lui-même dans le domaine infonuagique, le combiner avec DPDK est une alternative très prometteuse. DPDK permet d'améliorer la performance, la gestion de la mémoire et rend MKL évolutif.----------ABSTRACT In software defined radio, digital signal processing requires strict real time processing of data and signals. Specifically, in the development of the Long Term Evolution (LTE) standard, real time and low latency of computation processes are essential to obtain good user experience. As low latency computation is critical in real time processing of LTE, we explore the possibility of using Graphics Processing Units (GPUs) to accelerate its functions. As the first contribution of this thesis, we adopt NVIDIA GPU technology using the Compute Unified Device Architecture (CUDA) programming model in order to reduce the computation times of LTE. Furthermore, we investigate the efficiency of using MATLAB for parallel computing on GPUs. This allows us to evaluate MATLAB and CUDA programming paradigms and provide a comprehensive comparison between them for parallel computing of LTE processes on GPUs. We conclude that CUDA and Matlab accelerate processing of structured basic algorithms but that acceleration is variable and depends which algorithm is involved. Intel has proposed its Data Plane Development Kit (DPDK) as a tool to develop high performance software for processing of telecommunication data. As the second contribution of this thesis, we explore the possibility of using DPDK and isolation of operating system to reduce the variability of the computation times of LTE processes. Specifically, we use DPDK along with the Math Kernel Library (MKL) provided by Intel to calculate Fast Fourier Transforms (FFT) associated with LTE processes and measure their computation times. We study the computation times in different scenarios where FFT calculation is done with and without the isolation of processing units along the use of DPDK. Our experimental analysis shows that when DPDK and MKL are simultaneously used and the processing units are isolated, the resulting processing times of FFT calculation are reduced and have a near-deterministic characteristic. Explicitly, using DPDK and MKL along with the isolation of processing units reduces the mean and standard deviation of processing times for FFT calculation by 100 times and 20 times, respectively. Moreover, we conclude that although MKL reduces the computation time of FFTs, it does not offer a scalable solution but combining it with DPDK is a promising avenue

    Design and Implementation of Efficient Algorithms for Wireless MIMO Communication Systems

    Full text link
    En la última década, uno de los avances tecnológicos más importantes que han hecho culminar la nueva generación de banda ancha inalámbrica es la comunicación mediante sistemas de múltiples entradas y múltiples salidas (MIMO). Las tecnologías MIMO han sido adoptadas por muchos estándares inalámbricos tales como LTE, WiMAS y WLAN. Esto se debe principalmente a su capacidad de aumentar la máxima velocidad de transmisión , junto con la fiabilidad alcanzada y la cobertura de las comunicaciones inalámbricas actuales sin la necesidad de ancho de banda extra ni de potencia de transmisión adicional. Sin embargo, las ventajas proporcionadas por los sistemas MIMO se producen a expensas de un aumento sustancial del coste de implementación de múltiples antenas y de la complejidad del receptor, la cual tiene un gran impacto sobre el consumo de energía. Por esta razón, el diseño de receptores de baja complejidad es un tema importante que se abordará a lo largo de esta tesis. En primer lugar, se investiga el uso de técnicas de preprocesado de la matriz de canal MIMO bien para disminuir el coste computacional de decodificadores óptimos o bien para mejorar las prestaciones de detectores subóptimos lineales, SIC o de búsqueda en árbol. Se presenta una descripción detallada de dos técnicas de preprocesado ampliamente utilizadas: el método de Lenstra, Lenstra, Lovasz (LLL) para lattice reduction (LR) y el algorimo VBLAST ZF-DFE. Tanto la complejidad como las prestaciones de ambos métodos se han evaluado y comparado entre sí. Además, se propone una implementación de bajo coste del algoritmo VBLAST ZF-DFE, la cual se incluye en la evaluación. En segundo lugar, se ha desarrollado un detector MIMO basado en búsqueda en árbol de baja complejidad, denominado detector K-Best de amplitud variable (VB K-Best). La idea principal de este método es aprovechar el impacto del número de condición de la matriz de canal sobre la detección de datos con el fin de disminuir la complejidad de los sistemasRoger Varea, S. (2012). Design and Implementation of Efficient Algorithms for Wireless MIMO Communication Systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16562Palanci

    A GPU implementation of an iterative receiver for energy saving MIMO ID-BICM systems

    Full text link
    Iterative detection and decoding in communication systems with multiple transmitter and receiver antennas suffer from a significant increase in the computational cost and energy consumption. Nowadays, application of specific high-performance computing techniques for signal processing in communication systems is receiving considerable attention. In this paper, we present an accelerated and efficient iterative receiver, which has been implemented following two strategies. First, we reduce the computational cost using parallelized algorithms executed on graphics processing unit. In addition, our receiver allows the selection between two types of detectors with different complexity and performance. The selection can be done to fulfill a given compromise between bit error rate and power consumptionThis work has been supported by European Union ERDF and Spanish Government through TEC2012-38142-C04 project and Generalitat Valenciana through PROMETEO/2009/013 project.Ramiro Sánchez, C.; Simarro Haro, MDLA.; Martínez Zaldívar, FJ.; Vidal Maciá, AM.; Gonzalez, A. (2014). A GPU implementation of an iterative receiver for energy saving MIMO ID-BICM systems. The Journal of Supercomputing. 70(2):541-551. https://doi.org/10.1007/s11227-013-1081-xS541551702Barbero L, Thompson J (2008) Extending a fixed-complexity sphere decoder to obtain likelihood information for turbo-MIMO systems. IEEE Trans Veh Technol 57(5):2804–2814Barbero LG, Thompson JS (2008) Fixing the complexity of the sphere decoder for MIMO detection. IEEE Trans Wirel Commun 7(6):2131–2134Boutros J, Gresset N, Brunel L, Fossorier M (2003) Soft-input soft-output lattice sphere decoder for linear channels. Proc IEEE GLOBECOM 3(2):1583–1587Choi J (2010) Optimal combining and detection. Cambridge University Press, CambridgeGuo Z, Nilsson P (2006) Algorithm and implementation of the k-best sphere decoding for mimo detection. IEEE J Sel Areas Commun 24(3):491–503Hassibi B, Vikalo H (2005) On sphere decoding algorithm. Part I, the expected complexity. Trans Signal Process 54(5):2806–2818Hochwald BM, Brink ST (2003) Achieving near-capacity on a multiple-antenna channel. IEEE Trans Commun 51(3):389–399Larsson EG (2009) MIMO detection methods: how they work. IEEE Signal Process Mag 26(3):91–95Li X, Ritcey J (1987) Bit interleaved coded modulation with iterative decoding. IEEE Commun Lett 1:169–171Lu B, Wang X, Narayanan K (2002) LDPC-based space-time coded OFDM systems over correlated fading channels: performance analysis and receiver design. IEEE Trans Commun 50(1):74–88Martínez-Zaldívar F, Vidal A, Gonzalez A, Almenar V (2011) Tridimensional block multiword LDPC decoding on GPUs. J Supercomput 58(3):314–322. doi: 10.1007/s11227-011-0587-3NVIDIA (2013) NVIDIA CUDA C programming guide, version 5.5Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal A (2012) An efficient GPU implementation of fixed-complexity sphere decoders for MIMO wireless systems. Integr Comput Aided Eng 19(4):341–350Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal A (2012) Fully parallel GPU implementation of a fixed-complexity soft-output MIMO detector. IEEE Trans Veh Technol 61(8):3796–3800Simarro M, Ramiro C, Martínez-Zaldívar F, Vidal A, Gonzalez A (2013) A parallel iterative MIMO receiver with variable complexity detectors. Proc Int Conf Comput Math Methods Sci Eng 4:1242–1279Studer C, Burg A, Bölcskei H (2008) Soft-output sphere decoding algorithms and VLSI implementation. IEEE J Sel Areas Commun 26(2):290–300Tanner R (1981) A recursive approach to low complexity codes. IEEE Trans Inf Theory 27(5):533–547Zehavi E (1988) 8-PSK trellis codes for a Ralyleigh fading channel. IEEE Trans Commun 36:1004–101

    MIMOPack: a high-performance computing library for MIMO communication systems

    Full text link
    This paper presents MIMOPack, a set of optimized functions to perform some of themost complex stages in multiple-input multiple-output (MIMO) communication systems such as channel coding, preprocessing, precoding and detection. These functions are optimized to be run in a wide range of architectures increasing the portability of scientific codes between different computing environments. MIMOPack aims to become a useful library for the research community facilitating to the programmer the development of adaptable parallel applications and also to speed up simulation platforms used to assess different technologies proposed by several companies involved in standarization processes.This work has been supported by SP20120646 project of Universitat Politecnica de Valencia, by ISIC/2012/006 and PROMETEO FASE II 2014/003 projects of Generalitat Valenciana; and has been supported by European Union ERDF and Spanish Government through TEC2012-38142-C04-01.Ramiro Sánchez, C.; Vidal Maciá, AM.; Gonzalez, A. (2015). MIMOPack: a high-performance computing library for MIMO communication systems. The Journal of Supercomputing. 71(2):751-760. https://doi.org/10.1007/s11227-014-1328-1S751760712Paulraj AJ, Gore DA, Nabar RU, Blcskei H (2004) An overview of MIMO communications—a key to gigabit wireless. Proc IEEE 92(2):198–218Rusek F, Persson D, Lau B, Larsson E, Marzetta T, Edfors O, Tufvesso F (2013) Scaling up MIMO: opportunities and challenges with very large arrays. IEEE Signal Process Mag 30(1):40–60Lin Y, Lee H, Woh M, Harel Y, Mahlke S, Mudge T, Chakrabarti C, Flautner K (2007) SODA: a high-performance DSP architecture for software-defined radio. IEEE MICRO 27(1):114–123Yang C-H, Markovic D (2008) A multi-core sphere decoder VLSI architecture for MIMO communications. Global telecommunications conference, pp 1–6Wu D, Eilert J, Liu D (2011) Implementation of a high-speed MIMO soft-output symbol detector for software defined radio. J Signal Process Syst 63(1):27–37Tan K, Liu H, Zhang J, Zhang Y, Fang J, Voelke GM (2011) Sora: high-performance software radio using general-purpose multi-core processors. Commun ACM 54(1):99–107Wu M, Sun Y, Gupta S, Cavallaro J (2011) Implementation of a high throughput soft MIMO detector on GPU. J Signal Process Syst 64(2):123–136Nylanden T, Janhunen J, Silven O, Juntti M (2010) A GPU implementation for two MIMO-OFDM detectors. International conference on embedded computer systems, pp 293–300Falcao G, Silva V, Sousa L (2009) How GPUs can outperform ASICs for fast LDPC decoding. International conference of supercomputing, pp 123–136Innovative Computing Laboratory, University Tennessee, Knoxville (2009) MAGMA: Matrix algebra on GPU and multicore architectures. Available at http://icl.cs.utk.edu/magma/index.htmlEM Photonics, Inc (2010) CULA Tools - GPU accelerated LAPACK. Available at http://www.culatools.comMathWorks, Inc. (2011) Communications System Toolbox - Design and simulate the physical layer of communication systems. http://www.mathworks.es/products/communications/ITPP-C++ Library for Mathematical, signal processing, speech processing, and communications classes and functions. Available at http://itpp.sourceforge.netRoger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) An efficient GPU implementation of fixed-complexity sphere decoders for MIMO wireless systems. Integr Comput-Aided Eng 19(4):341–350Ramiro C, Roger S, Gonzalez A, Almenar V, Vidal AM (2013) Multi-core implementation of a fixed-complexity tree-search detector for MIMO communications. J Supercomput 65(3):1010–1019Garcia VM, Gonzalez A, Gonzalez C, Martinez-Zaldivar FJ, Ramiro C, Roger S, Vidal AM (2011) The impact of GPU/multicore in signal processing: a quantitative approach. Waves 3:96–106Roger S, Ramiro C, Gonzalez A, Almenar V, Vidal AM (2012) Fully parallel GPU implementation of a fixed-complexity soft-output MIMO detector. IEEE Trans Veh Technol 61(8):3796–3800Domene F, Roger S, Ramiro C, Piero G, Gonzalez A (2012) A reconfigurable GPU implementation for Tomlinson–Harashima precoding. 37th international conference on acoustics, Kyoto, JapanDomene F, Roger S, Ramiro C, Piero G, Gonzalez A (2012) Efficient implementation of multiuser precoding algorithms on GPU for MIMO-OFDM systems. XXVII Simposium Nacional de la Unin Cientfica Internacional de Radio, Elche, SpainRamiro C, Simarro Haro MA, Martinez-Zaldivar MJ, Vidal AM, Gonzalez A (2013) A GPU implementation of an iterative receiver for energy saving MIMO ID-BICM systems. J Supercomput. doi: 10.1007/s11227-013-1081-xLarsson EG (2009) MIMO detection methods: how they work [lecture notes]. Signal Process Mag IEEE 26(3):91–95. doi: 10.1109/MSP.2009.93212
    corecore