16 research outputs found

    New FFT/IFFT Factorizations with Regular Interconnection Pattern Stage-to-Stage Subblocks

    Get PDF
    Les factoritzacions de la FFT (Fast Fourier Transform) que presenten un patró d’interconnexió regular entre factors o etapes son conegudes com algorismes paral·lels, o algorismes de Pease, ja que foren originalment proposats per Pease. En aquesta contribució s’han desenvolupat noves factoritzacions amb blocs que presenten el patró d’interconnexió regular de Pease. S’ha mostrat com aquests blocs poden ser obtinguts a una escala prèviament seleccionada. Les noves factoritzacions per ambdues FFT i IFFT (Inverse FFT) tenen dues classes de factors: uns pocs factors del tipus Cooley-Tukey i els nous factors que proporcionen la mateix patró d’interconnexió de Pease en blocs. Per a una factorització donada, els blocs comparteixen dimensions, el patró d’interconnexió etapa a etapa i a més cada un d’ells pot ser calculat independentment dels altres.FFT (Fast Fourier Transform) factorizations presenting a regular interconnection pattern between factors or stages are known as parallel algorithms, or Pease algorithms since were first proposed by Pease. In this paper, new FFT/IFFT (Inverse FFT) factorizations with blocks that exhibit regular Pease interconnection pattern are derived. It is shown these blocks can be obtained at a previously selected scale. The new factorizations for both the FFT and IFFT have two kinds of factors: a few Cooley-Tukey type factors and new factors providing the same Pease interconnection pattern property in blocks. For a given factorization, these blocks share dimensions, the interconnection pattern stage-to-stage, and all of them can be calculated independently from one another.Las factoritzaciones de la FFT (Fast Fourier Transform) que presentan un patrón de interconexiones regular entre factores o etapas son conocidas como algoritmos paralelos, o algoritmos de Pease, puesto que fueron originalmente propuestos por Pease. En esta contribución se han desarrollado nuevas factoritzaciones en subbloques que presentan el patrón de interconexión regular de Pease. Se ha mostrado como estos bloques pueden ser obtenidos a una escalera previamente seleccionada. Las nuevas factoritzaciones para ambas FFT y IFFT (Inverse FFT) tienen dos clases de factores: unos pocos factores del tipo Cooley-Tukey y los nuevos factores que proporcionan el mismo patrón de interconexión de Pease en bloques. Para una factoritzación dada, los bloques comparten dimensiones, patrón d’interconexión etapa a etapa y además cada uno de ellos puede ser calculado independientemente de los otros

    Desenvolvimento em VHDL da camada física de um transmissor 4G

    Get PDF
    The LTE and LTE-Advanced technologies are standards to the fourth mobile generation, or 4G. The planned successor of this mobile generation is 5G, which will be based on 5G-New Radio (5G-NR) standard. The 5G technology is on an initial phase of deployment. One of its features that are essential in this initial phase is the support for 4G communications, because many of the mobile devices currently in use do not have support for 5G communications. This support is made possible if there is an implementation where 4G and 5G networks both coexist with each other. In the future, with the increasing usage of mobile devices with 5G support, there will be a gradual migration of 4G networks to 5G, releasing frequency spectrums currently reserved for 4G so that those can be occupied by 5G. The data transmissions in 4G require quite a lot of the processing capacity of all systems within the mobile network. For 5G, the data transmissions, in terms of traffic volume and speed, are larger than 4G transmissions, requiring new systems to be implemented, to allow the processing of larger quantities of data. Implementation in hardware of a 4G Uplink transmission chain, at the physical layer level PHY-Low, will allow the optimization of certain processes that a CPU could handle, reducing CPU usage and time spent on processing. The use of FPGAs makes this possible, as FPGAs can perform parallel tasks simultaneously and perform digital signal processing. The purpose of this dissertation is the modelling of a 4G LTE Uplink transmitter, at the physical layer level. Then, synthesizable VHDL code is generated from the modeled system, which can be eventually implemented in FPGAs. The modelling of the system is made in Simulink, a tool inside the MATLAB software, which allows for modelling, simulating and analyzing systems in a graphic environment and has applications in control systems and digital signal processing. The VHDL code is generated from HDL Coder, another tool in MATLAB software, generating synthesizable Verilog and VHDL code, from the MATLAB functions and Simulink models. The results obtained of processed data from the system are analyzed and validated, comparing the reference data generated from Wireless Waveform Generator toolbox in MATLAB.A tecnologia LTE e LTE-Advanced são standards da quarta geração de comunicações moveis atuais, ou 4G. Futuramente, o 5G marca a próxima geração de comunicações moveis, segundo o standard 5G-New Radio (5GNR). A tecnologia 5G encontra-se numa fase inicial de implementação, sendo que nessa fase uma das suas características fundamentais é o suporte para comunicações 4G, pois muitos dos dispositivos moveis usados atualmente não possuem suporte para comunicações 5G. Este suporte para 4G é tornado possível, se for feita uma implementação onde as redes 4G e 5G se encontrem em coexistência. No futuro, com o aumento do uso de dispositivos moveis com suporte para 5G, haverá uma migração gradual de redes 4G para 5G, libertando os espectros de frequências reservados atualmente para o 4G para serem ocupados pelo 5G. As transmissões de dados no 4G exigem bastante da capacidade de processamento de todos os sistemas da rede movel. Para o 5G, as transmissões de dados tem volumes de tráfego e velocidades maiores do que as transmissões de dados 4G, fazendo com que novos sistemas tenham de ser implementados para poder processar maiores quantidades de dados. A implementação em hardware da cadeia de transmissão 4G Uplink, ao nível da camada física PHY-Low, permitirá a otimização de certos processos que um CPU poderia lidar, diminuindo o uso do CPU e o tempo gasto em processamento. O uso de FPGAs torna isto possível, tendo em conta que podem realizar tarefas em paralelo, em modo simultâneo, e fazer processamento digital de sinal. O objetivo desta dissertação assenta na modelação de um transmissor 4G LTE Uplink, ao nível da camada física. Depois, é gerado código VHDL sintetizável a partir do sistema modelado, que eventualmente será implementada em FPGAs. A modelação do sistema é feito em Simulink, uma ferramenta no software do MATLAB, que permite modelar, simular e analisar sistemas num ambiente gráfico e tem aplicações para sistemas de controlo e processamento digital de sinal. O código VHDL é gerado a partir do HDL Coder, uma outra ferramenta no software do MATLAB, que gera Verilog e VHDL sintetizáveis, a partir de funções MATLAB e de modelos Simulink. Os resultados obtidos dos dados processados pelo sistema são analisados e validados, comparando com os dados de referência obtidos a partir da toolbox Wireless Waveform Generator do MATLAB.Mestrado em Engenharia Eletrónica e Telecomunicaçõe

    FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

    Full text link
    Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in O(NlogN)O(N logN) time in sequence length NN but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93×\times over PyTorch and achieves up to 4.4×\times speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality

    Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

    Get PDF
    An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU

    Digital and Mixed Domain Hardware Reduction Algorithms and Implementations for Massive MIMO

    Get PDF
    Emerging 5G and 6G based wireless communications systems largely rely on multiple-input-multiple-output (MIMO) systems to reduce inherently extensive path losses, facilitate high data rates, and high spatial diversity. Massive MIMO systems used in mmWave and sub-THz applications consists of hundreds perhaps thousands of antenna elements at base stations. Digital beamforming techniques provide the highest flexibility and better degrees of freedom for phased antenna arrays as compared to its analog and hybrid alternatives but has the highest hardware complexity. Conventional digital beamformers at the receiver require a dedicated analog to digital converter (ADC) for every antenna element, leading to ADCs for elements. The number of ADCs is the key deterministic factor for the power consumption of an antenna array system. The digital hardware consists of fast Fourier transform (FFT) cores with a multiplier complexity of (N log2N) for an element system to generate multiple beams. It is required to reduce the mixed and digital hardware complexities in MIMO systems to reduce the cost and the power consumption, while maintaining high performance. The well-known concept has been in use for ADCs to achieve reduced complexities. An extension of the architecture to multi-dimensional domain is explored in this dissertation to implement a single port ADC to replace ADCs in an element system, using the correlation of received signals in the spatial domain. This concept has applications in conventional uniform linear arrays (ULAs) as well as in focal plane array (FPA) receivers. Our analysis has shown that sparsity in the spatio-temporal frequency domain can be exploited to reduce the number of ADCs from N to where . By using the limited field of view of practical antennas, multiple sub-arrays are combined without interferences to achieve a factor of K increment in the information carrying capacity of the ADC systems. Applications of this concept include ULAs and rectangular array systems. Experimental verifications were done for a element, 1.8 - 2.1 GHz wideband array system to sample using ADCs. This dissertation proposes that frequency division multiplexing (FDM) receiver outputs at an intermediate frequency (IF) can pack multiple (M) narrowband channels with a guard band to avoid interferences. The combined output is then sampled using a single wideband ADC and baseband channels are retrieved in the digital domain. Measurement results were obtained by employing a element, 28 GHz antenna array system to combine channels together to achieve a 75% reduction of ADC requirement. Implementation of FFT cores in the digital domain is not always exact because of the finite precision. Therefore, this dissertation explores the possibility of approximating the discrete Fourier transform (DFT) matrix to achieve reduced hardware complexities at an allowable cost of accuracy. A point approximate DFT (ADFT) core was implemented on digital hardware using radix-32 to achieve savings in cost, size, weight and power (C-SWaP) and synthesized for ASIC at 45-nm technology

    Algorithms and Circuits for Analog-Digital Hybrid Multibeam Arrays

    Get PDF
    Fifth generation (5G) and beyond wireless communication systems will rely heavily on larger antenna arrays combined with beamforming to mitigate the high free-space path-loss that prevails in millimeter-wave (mmW) and above frequencies. Sharp beams that can support wide bandwidths are desired both at the transmitter and the receiver to leverage the glut of bandwidth available at these frequency bands. Further, multiple simultaneous sharp beams are imperative for such systems to exploit mmW/sub-THz wireless channels using multiple reflected paths simultaneously. Therefore, multibeam antenna arrays that can support wider bandwidths are a key enabler for 5G and beyond systems. In general, N-beam systems using N-element antenna arrays will involve circuit complexities of the order of N2. This dissertation investigates new analog, digital and hybrid low complexity multibeam beamforming algorithms and circuits for reducing the associated high size, weight, and power (SWaP) complexities in larger multibeam arrays. The research efforts on the digital beamforming aspect propose the use of a new class of discrete Fourier transform (DFT) approximations for multibeam generation to eliminate the need for digital multipliers in the beamforming circuitry. For this, 8-, 16- and 32-beam multiplierless multibeam algorithms have been proposed for uniform linear array applications. A 2.4 GHz 16-element array receiver setup and a 5.8 GHz 32-element array receiver system which use field programmable gate arrays (FPGAs) as digital backend have been built for real-time experimental verification of the digital multiplierless algorithms. The multiplierless algorithms have been experimentally verified by digitally measuring beams. It has been shown that the measured beams from the multiplierless algorithms are in good agreement with the exact counterpart algorithms. Analog realizations of the proposed approximate DFT transforms have also been investigated leading to low-complex, high bandwidth circuits in CMOS. Further, a novel approach for reducing the circuit complexity of analog true-time delay (TTD) N-beam beamforming networks using N-element arrays has been proposed for wideband squint-free operation. A sparse factorization of the N-beam delay Vandermonde beamforming matrix is used to reduce the total amount of TTD elements that are needed for obtaining N number of beams in a wideband array. The method has been verified using measured responses of CMOS all-pass filters (APFs). The wideband squint-free multibeam algorithm is also used to propose a new low-complexity hybrid beamforming architecture targeting future 5G mmW systems. Apart from that, the dissertation also explores multibeam beamforming architectures for uniform circular arrays (UCAs). An algorithm having N log N circuit complexity for simultaneous generation of N-beams in an N-element UCA is explored and verified

    Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms

    Get PDF
    This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach. The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms. Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods. We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs. This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution. The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms. Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms

    Performance Primitives for Artificial Neural Networks

    Get PDF
    Optimized software implementations of artificial neural networks leverage primitives from performance libraries, such as the BLAS. However, these primitives were prototyped decades ago, and do not necessarily reflect the patterns of computations in neural networks. I propose modifications to common primitives provided by performance libraries to make them better building blocks for artificial neural networks, with a focus on inference, i.e. evaluation of a pre-trained artificial neural network. I suggest three classes of performance primitives for the convolutional operators and two optimized building blocks for softmax operators. High-intensity convolutional operators with large kernel sizes and unit stride benefit from asymptotically fast convolution algorithms based on Winograd transform and Fast Fourier transform. I jointly consider Fourier or Winograd transform and the matrix-matrix multiplication of blocks of transformed coefficients and suggest tuple-GEMM primitive which balance the number of irregular memory writes in the transformation with sufficient register blocking and instruction-level parallelism in the matrix-matrix multiplication part. Tuple-GEMM primitive can be thought of as a batched GEMM with a fixed architecture-dependent batch size and can be efficiently implemented as a modification of the Goto matrix-matrix multiplication algorithm. I additionally analyze small 2D Fast Fourier transforms, and suggest options that work best for modern wide-SIMD processors. Lower-intensity convolutional operators with small kernel sizes, non-unit strides, or dilation do not benefit from the fast convolution algorithms and require a different set of optimizations. To accelerate these cases I suggest replacing the traditional GEMM primitive with a novel Indirect GEMM primitive. Indirect GEMM primitive is a slight modification of GEMM and can leverage the extensive research on efficient GEMM implementations. I further introduce the Indirect Convolution algorithm which builds on top of the Indirect GEMM primitive, eliminates the runtime overhead of patch-building memory transformations and substantially reduce the memory complexity in convolutional operators compared to the traditional GEMM-based algorithms. Pointwise, or 1x1, convolutional operators directly map to matrix-matrix multiplication, and prompt yet another approach to optimization. I demonstrate that neural networks heavy on pointwise convolutions can greatly benefit from sparsification of the weights tensor and representing the operation as a sparse-matrix-dense-matrix multiplication (SpMM) and introduce neural network-optimized SpMM primitives. While SpMM primitives in Sparse BLAS libraries target problems with extremely high sparsity (commonly 99+% sparsity) and non-random sparsity patterns, the proposed SpMM primitive is demonstrated to work well with moderate sparsity in the 70-95% range and unpredictable sparsity patterns. Softmax operator is light on elementary floating-point operations, but involves evaluation of the exponential function, which in many implementations becomes the bottleneck. I demonstrate that with the high-throughput vector exponential function the softmax computation saturates the memory bandwidth and can be further improved only by reducing the number of memory access operations. I then constructively prove that it is possible to replace the traditional three-pass softmax algorithms with a novel two-pass algorithm for up to 28% runtime reduction. I implemented the proposed ideas in the open source NNPACK, QNNPACK, and XNNPACK libraries for acceleration of neural networks on CPUs, which at the time of release delivered state-of-the-art performance on mobile, server, and Web platforms.Ph.D

    Methods for Estimating The Diagonal of Matrix Functions

    Get PDF
    Many applications such as path integral evaluation in Lattice Quantum Chromodynamics (LQCD), variance estimation of least square solutions and spline ts, and centrality measures in network analysis, require computing the diagonal of a function of a matrix, Diag(f(A)) where A is sparse matrix, and f is some function. Unfortunately, when A is large, this can be computationally prohibitive. Because of this, many applications resort to Monte Carlo methods. However, Monte Carlo methods tend to converge slowly. One method for dealing with this shortcoming is probing. Probing assumes that nodes that have a large distance between them in the graph of A, have only a small weight connection in f(A). to determine the distances between nodes, probing forms Ak. Coloring the graph of this matrix will group nodes that have a high distance between them together, and thus a small connection in f(A). This enables the construction of certain vectors, called probing vectors, that can capture the diagonals of f(A). One drawback of probing is in many cases it is too expensive to compute and store A^k for the k that adequately determines which nodes have a strong connection in f(A). Additionally, it is unlikely that the set of probing vectors required for A^k is a subset of the probing vectors needed for Ak+1. This means that if more accuracy in the estimation is required, all previously computed work must be discarded. In the case where the underlying problem arises from a discretization of a partial dierential equation (PDE) onto a lattice, we can make use of our knowledge of the geometry of the lattice to quickly create hierarchical colorings for the graph of A^k. A hierarchical coloring is one in which colors for A^{k+1} are created by splitting groups of nodes sharing a color in A^k. The hierarchical property ensures that the probing vectors used to estimate Diag(f(A)) are nested subsets, so if the results are inaccurate the estimate can be improved without discarding the previous work. If we do not have knowledge of the intrinsic geometry of the matrix, we propose two new classes of methods that improve on the results of probing. One method seeks to determine structural properties of the matrix f(A) by obtaining random samples of the columns of f(A). The other method leverages ideas arising from similar problems in graph partitioning, and makes use of the eigenvectors of f(A) to form effective hierarchical colorings. Our methods have thus far seen successful use in computational physics, where they have been applied to compute observables arising in LQCD. We hope that the renements presented in this work will enable interesting applications in many other elds
    corecore