89 research outputs found

    High-Level Annotation of Routing Congestion for Xilinx Vivado HLS Designs

    Get PDF
    Ever since transistor cost stopped decreasing, customized programmable platforms, such as field-programmable gate arrays (FPGAs), became a major way to improve software execution performance and energy consumption. While software developers can use high-level synthesis (HLS) to speed up register-transfer level (RTL) code generation from C++ or OpenCL source code, placement and routing issues, such as congestion, can still prevent achieving an FPGA programming bitstream or dramatically reduce the FPGA implementation performance. Congestion reports from physical design tools refer to thousands of RTL signal names instead of developer-accessible identifiers and statements, considerably complicating the developer understanding and resolution of the issues at the source level. We propose a high-level back-annotation flow that summarizes the routing congestion issues at the source level by analyzing the reports from the FPGA physical design tools and the internal debugging files of the HLS tools. Our flow describes congestion using comments back-annotated on the source code and identifies if the congestion causes are the on-chip memories or the DSP units (multipliers/adders), which are the shared resources very often associated with routing problems on FPGAs. We demonstrate on realistic large designs how the information provided by our flow helps to quickly spot congestion causes at the source level and to solve them using appropriate HLS directives

    Adaptive Baseband Pro cessing and Configurable Hardware for Wireless Communication

    Get PDF
    The world of information is literally at one’s fingertips, allowing access to previously unimaginable amounts of data, thanks to advances in wireless communication. The growing demand for high speed data has necessitated theuse of wider bandwidths, and wireless technologies such as Multiple-InputMultiple-Output (MIMO) have been adopted to increase spectral efficiency.These advanced communication technologies require sophisticated signal processing, often leading to higher power consumption and reduced battery life.Therefore, increasing energy efficiency of baseband hardware for MIMO signal processing has become extremely vital. High Quality of Service (QoS)requirements invariably lead to a larger number of computations and a higherpower dissipation. However, recognizing the dynamic nature of the wirelesscommunication medium in which only some channel scenarios require complexsignal processing, and that not all situations call for high data rates, allowsthe use of an adaptive channel aware signal processing strategy to provide adesired QoS. Information such as interference conditions, coherence bandwidthand Signal to Noise Ratio (SNR) can be used to reduce algorithmic computations in favorable channels. Hardware circuits which run these algorithmsneed flexibility and easy reconfigurability to switch between multiple designsfor different parameters. These parameters can be used to tune the operations of different components in a receiver based on feedback from the digitalbaseband. This dissertation focuses on the optimization of digital basebandcircuitry of receivers which use feedback to trade power and performance. Aco-optimization approach, where designs are optimized starting from the algorithmic stage through the hardware architectural stage to the final circuitimplementation is adopted to realize energy efficient digital baseband hardwarefor mobile 4G devices. These concepts are also extended to the next generation5G systems where the energy efficiency of the base station is improved.This work includes six papers that examine digital circuits in MIMO wireless receivers. Several key blocks in these receiver include analog circuits thathave residual non-linearities, leading to signal intermodulation and distortion.Paper-I introduces a digital technique to detect such non-linearities and calibrate analog circuits to improve signal quality. The concept of a digital nonlinearity tuning system developed in Paper-I is implemented and demonstratedin hardware. The performance of this implementation is tested with an analogchannel select filter, and results are presented in Paper-II. MIMO systems suchas the ones used in 4G, may employ QR Decomposition (QRD) processors tosimplify the implementation of tree search based signal detectors. However,the small form factor of the mobile device increases spatial correlation, whichis detrimental to signal multiplexing. Consequently, a QRD processor capableof handling high spatial correlation is presented in Paper-III. The algorithm and hardware implementation are optimized for carrier aggregation, which increases requirements on signal processing throughput, leading to higher powerdissipation. Paper-IV presents a method to perform channel-aware processingwith a simple interpolation strategy to adaptively reduce QRD computationcount. Channel properties such as coherence bandwidth and SNR are used toreduce multiplications by 40% to 80%. These concepts are extended to usetime domain correlation properties, and a full QRD processor for 4G systemsfabricated in 28 nm FD-SOI technology is presented in Paper-V. The designis implemented with a configurable architecture and measurements show thatcircuit tuning results in a highly energy efficient processor, requiring 0.2 nJ to1.3 nJ for each QRD. Finally, these adaptive channel-aware signal processingconcepts are examined in the scope of the next generation of communicationsystems. Massive MIMO systems increase spectral efficiency by using a largenumber of antennas at the base station. Consequently, the signal processingat the base station has a high computational count. Paper-VI presents a configurable detection scheme which reduces this complexity by using techniquessuch as selective user detection and interpolation based signal processing. Hardware is optimized for resource sharing, resulting in a highly reconfigurable andenergy efficient uplink signal detector

    Efficient Algorithms for Solving Structured Eigenvalue Problems Arising in the Description of Electronic Excitations

    Get PDF
    Matrices arising in linear-response time-dependent density functional theory and many-body perturbation theory, in particular in the Bethe-Salpeter approach, show a 2 × 2 block structure. The motivation to devise new algorithms, instead of using general purpose eigenvalue solvers, comes from the need to solve large problems on high performance computers. This requires parallelizable and communication-avoiding algorithms and implementations. We point out various novel directions for diagonalizing structured matrices. These include the solution of skew-symmetric eigenvalue problems in ELPA, as well as structure preserving spectral divide-and-conquer schemes employing generalized polar decompostions

    Algorithms in Lattice QCD

    Get PDF
    The enormous computing resources that large-scale simulations in Lattice QCD require will continue to test the limits of even the largest supercomputers into the foreseeable future. The efficiency of such simulations will therefore concern practitioners of lattice QCD for some time to come. I begin with an introduction to those aspects of lattice QCD essential to the remainder of the thesis, and follow with a description of the Wilson fermion matrix M, an object which is central to my theme. The principal bottleneck in Lattice QCD simulations is the solution of linear systems involving M, and this topic is treated in depth. I compare some of the more popular iterative methods, including Minimal Residual, Corij ugate Gradient on the Normal Equation, BI-Conjugate Gradient, QMR., BiCGSTAB and BiCGSTAB2, and then turn to a study of block algorithms, a special class of iterative solvers for systems with multiple right-hand sides. Included in this study are two block algorithms which had not previously been applied to lattice QCD. The next chapters are concerned with a generalised Hybrid Monte Carlo algorithm (OHM C) for QCD simulations involving dynamical quarks. I focus squarely on the efficient and robust implementation of GHMC, and describe some tricks to improve its performance. A limited set of results from HMC simulations at various parameter values is presented. A treatment of the non-hermitian Lanczos method and its application to the eigenvalue problem for M rounds off the theme of large-scale matrix computations

    Proceedings of the Fifth NASA/NSF/DOD Workshop on Aerospace Computational Control

    Get PDF
    The Fifth Annual Workshop on Aerospace Computational Control was one in a series of workshops sponsored by NASA, NSF, and the DOD. The purpose of these workshops is to address computational issues in the analysis, design, and testing of flexible multibody control systems for aerospace applications. The intention in holding these workshops is to bring together users, researchers, and developers of computational tools in aerospace systems (spacecraft, space robotics, aerospace transportation vehicles, etc.) for the purpose of exchanging ideas on the state of the art in computational tools and techniques

    High performance selected inversion methods for sparse matrices: direct and stochastic approaches to selected inversion

    Get PDF
    The explicit evaluation of selected entries of the inverse of a given sparse matrix is an important process in various application fields and is gaining visibility in recent years. While a standard inversion process would require the computation of the whole inverse who is, in general, a dense matrix, state-of-the-art solvers perform a selected inversion process instead. Such approach allows to extract specific entries of the inverse, e.g., the diagonal, avoiding the standard inversion steps, reducing therefore time and memory requirements. Despite the complexity reduction already achieved, the natural direction for the development of the selected inversion software is the parallelization and distribution of the computation, exploiting multinode implementations of the algorithms. In this work we introduce parallel, high performance selected inversion algorithms suitable for both the computation and estimation of the diagonal of the inverse of large, sparse matrices. The first approach is built on top of a sparse factorization method and a distributed computation of the Schur-complement, and is specifically designed for the parallel treatment of large, dense matrices including a sparse block. The second is based on the stochastic estimation of the matrix diagonal using a stencil-based, matrix-free Krylov subspace iteration. We implement the two solvers and prove their excellent performance on Cray supercomputers, focusing on both the multinode scalability and the numerical accuracy. Finally, we include the solvers into two distinct frameworks designed for the solution of selected inversion problems in real-life applications. First, we present a parallel, scalable framework for the log-likelihood maximization in genomic prediction problems including marker by environment effects. Then, we apply the matrix-free estimator to the treatment of large-scale three-dimensional nanoelectronic device simulations with open boundary conditions

    Turbo Bayesian Compressed Sensing

    Get PDF
    Compressed sensing (CS) theory specifies a new signal acquisition approach, potentially allowing the acquisition of signals at a much lower data rate than the Nyquist sampling rate. In CS, the signal is not directly acquired but reconstructed from a few measurements. One of the key problems in CS is how to recover the original signal from measurements in the presence of noise. This dissertation addresses signal reconstruction problems in CS. First, a feedback structure and signal recovery algorithm, orthogonal pruning pursuit (OPP), is proposed to exploit the prior knowledge to reconstruct the signal in the noise-free situation. To handle the noise, a noise-aware signal reconstruction algorithm based on Bayesian Compressed Sensing (BCS) is developed. Moreover, a novel Turbo Bayesian Compressed Sensing (TBCS) algorithm is developed for joint signal reconstruction by exploiting both spatial and temporal redundancy. Then, the TBCS algorithm is applied to a UWB positioning system for achieving mm-accuracy with low sampling rate ADCs. Finally, hardware implementation of BCS signal reconstruction on FPGAs and GPUs is investigated. Implementation on GPUs and FPGAs of parallel Cholesky decomposition, which is a key component of BCS, is explored. Simulation results on software and hardware have demonstrated that OPP and TBCS outperform previous approaches, with UWB positioning accuracy improved by 12.8x. The accelerated computation helps enable real-time application of this work

    Real-time power system dynamic simulation

    Get PDF
    The present day digital computing resources are overburdened by the amount of calculation necessary for power system dynamic simulation. Although the hardware has improved significantly, the expansion of the interconnected systems, and the requirement for more detailed models with frequent solutions have increased the need for simulating these systems in real time. To achieve this, more effort has been devoted to developing and improving the application of numerical methods and computational techniques such as sparsity-directed approaches and network decomposition to power system dynamic studies. This project is a modest contribution towards solving this problem. It consists of applying a very efficient sparsity technique to the power system dynamic simulator under a wide range of events. The method used was first developed by Zollenkopf (^117) Following the structure of the linear equations related to power system dynamic simulator models, the original algorithm which was conceived for scalar calculation has been modified to use sets of 2 * 2 sub-matrices for both the dynamic and algebraic equations. The realisation of real-time simulators also requires the simplification of the power system models and the adoption of a few assumptions such as neglecting short time constants. Most of the network components are simulated. The generating units include synchronous generators and their local controllers, and the simulated network is composed of transmission lines and transformers with tap-changing and phase-shifting, non-linear static loads, shunt compensators and simplified protection. The simulator is capable of handling some of the severe events which occur in power systems such as islanding, island re-synchronisation and generator start-up and shut-down. To avoid the stiffness problem and ensure the numerical stability of the system at long time steps at a reasonable accuracy, the implicit trapezoidal rule is used for discretising the dynamic equations. The algebraisation of differential equations requires an iterative process. Also the non-linear network models are generally better solved by the Newton-Raphson iterative method which has an efficient quadratic rate of convergence. This has favoured the adoption of the simultaneous technique over the classical partitioned method. In this case the algebraised differential equations and the non-linear static equations are solved as one set of algebraic equations. Another way of speeding-up centralised simulators is the adoption of distributed techniques. In this case the simulated networks are subdivided into areas which are computed by a multi-task machine (Perkin Elmer PE3230). A coordinating subprogram is necessary to synchronise and control the computation of the different areas, and perform the overall solution of the system. In addition to this decomposed algorithm the developed technique is also implemented in the parallel simulator running on the Array Processor FPS 5205 attached to a Perkin Elmer PE 3230 minicomputer, and a centralised version run on the host computer. Testing these simulators on three networks under a range of events would allow for the assessment of the algorithm and the selection of the best candidate hardware structure to be used as a dedicated machine to support the dynamic simulator. The results obtained from this dynamic simulator are very impressive. Great speed-up is realised, stable solutions under very severe events are obtained showing the robustness of the system, and accurate long-term results are obtained. Therefore, the present simulator provides a realistic test bed to the Energy Management System. It can also be used for other purposes such as operator training

    ИНТЕЛЛЕКТУАЛЬНЫЙ числовым программным ДЛЯ MIMD-компьютер

    Get PDF
    For most scientific and engineering problems simulated on computers the solving of problems of the computational mathematics with approximately given initial data constitutes an intermediate or a final stage. Basic problems of the computational mathematics include the investigating and solving of linear algebraic systems, evaluating of eigenvalues and eigenvectors of matrices, the solving of systems of non-linear equations, numerical integration of initial- value problems for systems of ordinary differential equations.Для більшості наукових та інженерних задач моделювання на ЕОМ рішення задач обчислювальної математики з наближено заданими вихідними даними складає проміжний або остаточний етап. Основні проблеми обчислювальної математики відносяться дослідження і рішення лінійних алгебраїчних систем оцінки власних значень і власних векторів матриць, рішення систем нелінійних рівнянь, чисельного інтегрування початково задач для систем звичайних диференціальних рівнянь.Для большинства научных и инженерных задач моделирования на ЭВМ решение задач вычислительной математики с приближенно заданным исходным данным составляет промежуточный или окончательный этап. Основные проблемы вычислительной математики относятся исследования и решения линейных алгебраических систем оценки собственных значений и собственных векторов матриц, решение систем нелинейных уравнений, численного интегрирования начально задач для систем обыкновенных дифференциальных уравнений
    corecore