High performance lattice reduction on heterogeneous computing platform

Abstract

The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1201-2The lattice reduction (LR) technique has become very important in many engineering fields. However, its high complexity makes difficult its use in real-time applications, especially in applications that deal with large matrices. As a solution, the modified block LLL (MB-LLL) algorithm was introduced, where several levels of parallelism were exploited: (a) fine-grained parallelism was achieved through the cost-reduced all-swap LLL (CR-AS-LLL) algorithm introduced together with the MB-LLL by Jzsa et al. (Proceedings of the tenth international symposium on wireless communication systems, 2013) and (b) coarse-grained parallelism was achieved by applying the block-reduction concept presented by Wetzel (Algorithmic number theory. Springer, New York, pp 323-337, 1998). In this paper, we present the cost-reduced MB-LLL (CR-MB-LLL) algorithm, which allows to significantly reduce the computational complexity of the MB-LLL by allowing the relaxation of the first LLL condition while executing the LR of submatrices, resulting in the delay of the Gram-Schmidt coefficients update and by using less costly procedures during the boundary checks. The effects of complexity reduction and implementation details are analyzed and discussed for several architectures. A mapping of the CR-MB-LLL on a heterogeneous platform is proposed and it is compared with implementations running on a dynamic parallelism enabled GPU and a multi-core CPU. The mapping on the architecture proposed allows a dynamic scheduling of kernels where the overhead introduced is hidden by the use of several CUDA streams. Results show that the execution time of the CR-MB-LLL algorithm on the heterogeneous platform outperforms the multi-core CPU and it is more efficient than the CR-AS-LLL algorithm in case of large matrices.Financial support for this study was provided by grants TAMOP-4.2.1./B-11/2/KMR-2011-0002, TAMOP-4.2.2/B-10/1-2010-0014 from the Pazmany Peter Catholic University, European Union ERDF, Spanish Government through TEC2012-38142-C04-01 project and Generalitat Valenciana through PROMETEO/2009/013 project.Jozsa, CM.; Domene Oltra, F.; Vidal Maciá, AM.; Piñero Sipán, MG.; González Salvador, A. (2014). High performance lattice reduction on heterogeneous computing platform. Journal of Supercomputing. 70(2):772-785. https://doi.org/10.1007/s11227-014-1201-2S772785702Józsa CM, Domene F, Piñero G, González A, Vidal AM (2013) Efficient GPU implementation of lattice-reduction-aided multiuser precoding. In: Proceedings of the tenth international symposium on wireless communication systems (ISWCS 2013)Wetzel S (1998) An efficient parallel block-reduction algorithm. In: Buhler JP (ed) Algorithmic number theory. Lecture notes in computer science, vol 1423. Springer, Berlin, Heidelberg, pp 323–337Wubben D, Seethaler D, Jaldén J, Matz G (2011) Lattice reduction. Signal Process Mag IEEE 28(3):70–91Lenstra AK, Lenstra HW, Lovász L (1982) Factoring polynomials with rational coefficients. Math Ann 261(4):515–534Bremner MR (2012) Lattice basis reduction: an introduction to the LLL algorithm and its applications. CRC Press, USAWu D, Eilert J, Liu D (2008) A programmable lattice-reduction aided detector for MIMO-OFDMA. In: 4th IEEE international conference on circuits and systems for communications (ICCSC 2008), pp 293–297Barbero LG, Milliner DL, Ratnarajah T, Barry JR, Cowan C (2009) Rapid prototyping of Clarkson’s lattice reduction for MIMO detection. In: IEEE international conference on communications (ICC’09), pp 1–5Gestner B, Zhang W, Ma X, Anderson D (2011) Lattice reduction for MIMO detection: from theoretical analysis to hardware realization. IEEE Trans Circ Syst I Regul Pap 58(4):813–826Shabany M, Youssef A, Gulak G (2013) High-throughput 0.13- \upmu μ m CMOS lattice reduction core supporting 880 Mb/s detection. IEEE Trans Very Large Scale Integr (VLSI) Syst 21(5):848–861Luo Y, Qiao S (2011) A parallel LLL algorithm. In: Proceedings of the fourth international C* conference on computer science and software engineering, pp 93–101Backes W, Wetzel S (2011) Parallel lattice basis reduction—the road to many-core. In: IEEE 13th international conference on high performance computing and communications (HPCC)Ahmad U, Amin A, Li M, Pollin S, Van der Perre L, Catthoor F (2011) Scalable block-based parallel lattice reduction algorithm for an SDR baseband processor. In: 2011 IEEE international conference on communications (ICC)Villard G (1992) Parallel lattice basis reduction. In: Papers from the international symposium on symbolic and algebraic computation (ISSAC’92). ACM, New YorkDomene F, Józsa CM, Vidal AM, Piñero G, Gonzalez A (2013) Performance analysis of a parallel lattice reduction algorithm on many-core architectures. In: Proceedings of the 13th international conference on computational and mathematical methods in science and engineeringGestner B, Zhang W, Ma X, Anderson DV (2008) VLSI implementation of a lattice reduction algorithm for low-complexity equalization. In: 4th IEEE international conference on circuits and systems for communications (ICCSC 2008), pp 643–647Burg A, Seethaler D, Matz G (2007) VLSI implementation of a lattice-reduction algorithm for multi-antenna broadcast precoding. In: IEEE international symposium on circuits and systems (ISCAS 2007), pp 673–676Bruderer L, Studer C, Wenk M, Seethaler D, Burg A (2010) VLSI implementation of a low-complexity LLL lattice reduction algorithm for MIMO detection. In: Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS

    Similar works

    Full text

    thumbnail-image

    Available Versions