622 research outputs found

    VLSI design of high-speed adders for digital signal processing applications.

    Get PDF

    A New Optimization Technique for Improving Resource Exploitation and Critical Path Minimization

    Get PDF
    This paper presents a novel approach to algebraic optimization of data-flow graphs in the domain of computationally intensive applications. The presented approach is based upon the paradigm of simulated evolution which has been proven to be a powerful method for solving large non-linear optimization problems. We introduce a genetic algorithm with a new chromosomal representation of data-flow graphs that serves as a basis for preserving the correctness of algebraic transformations and allows an efficient implementation of the genetic operators. Furthermore, we introduce a new class of hardware-related transformation rules which for the first time allow to take existing component libraries into account. The efficiency of our method is demonstrated by encouraging experimental results for several standard benchmarks

    FlexCore: Massively Parallel and Flexible Processing for Large MIMO Access Points

    Get PDF
    Large MIMO base stations remain among wireless network designers’ best tools for increasing wireless throughput while serving many clients, but current system designs, sacrifice throughput with simple linear MIMO detection algorithms. Higher-performance detection techniques are known, but remain off the table because these systems parallelize their computation at the level of a whole OFDM subcarrier, sufficing only for the less demanding linear detection approaches they opt for. This paper presents FlexCore, the first computational architecture capable of parallelizing the detection of large numbers of mutually-interfering information streams at a granularity below individual OFDM subcarriers, in a nearly-embarrassingly parallel manner while utilizing any number of available processing elements. For 12 clients sending 64-QAM symbols to a 12-antenna base station, our WARP testbed evaluation shows similar network throughput to the state-of-the-art while using an order of magnitude fewer processing elements. For the same scenario, our combined WARP-GPU testbed evaluation demonstrates a 19x computational speedup, with 97% increased energy efficiency when compared with the state of the art. Finally, for the same scenario, an FPGA-based comparison between FlexCore and the state of the art shows that FlexCore can achieve up to 96% better energy efficiency, and can offer up to 32x the processing throughput

    Efficient Mapping of Neural Network Models on a Class of Parallel Architectures.

    Get PDF
    This dissertation develops a formal and systematic methodology for efficient mapping of several contemporary artificial neural network (ANN) models on k-ary n-cube parallel architectures (KNC\u27s). We apply the general mapping to several important ANN models including feedforward ANN\u27s trained with backpropagation algorithm, radial basis function networks, cascade correlation learning, and adaptive resonance theory networks. Our approach utilizes a parallel task graph representing concurrent operations of the ANN model during training. The mapping of the ANN is performed in two steps. First, the parallel task graph of the ANN is mapped to a virtual KNC of compatible dimensionality. This involves decomposing each operation into its atomic tasks. Second, the dimensionality of the virtual KNC architecture is recursively reduced through a sequence of transformations until a desired metric is optimized. We refer to this process as folding the virtual architecture. The optimization criteria we consider in this dissertation are defined in terms of the iteration time of the algorithm on the folded architecture. If necessary, the mapping scheme may utilize a subset of the processors of a given KNC architecture if it results in the most efficient simulation. A unique feature of our mapping is that it systematically selects an appropriate degree of parallelism leading to a highly efficient realization of the ANN model on KNC architectures. A novel feature of our work is its ability to efficiently map unit-allocating ANN\u27s. These networks possess a dynamic structure which grows during training. We present a highly efficient scheme for simulating such networks on existing KNC parallel architectures. We assume an upper bound on size of the neural network We perform the folding such that the iteration time of the largest network is minimized. We show that our mapping leads to near-optimal simulation of smaller instances of the neural network. In addition, based on our mapping no data migration or task rescheduling is needed as the size of network grows

    Emerging Technologies - NanoMagnets Logic (NML)

    Get PDF
    In the last decades CMOS technology has ruled the electronic scenario thanks to the constant scaling of transistor sizes. With the reduction of transistor sizes circuit area decreases, clock frequency increases and power consumption decreases accordingly. However CMOS scaling is now approaching its physical limits and many believe that CMOS technology will not be able to reach the end of the Roadmap. This is mainly due to increasing difficulties in the fabrication process, that is becoming very expensive, and to the unavoidable impact of leakage losses, particularly thanks to gate tunnel current. In this scenario many alternative technologies are studied to overcome the limitations of CMOS transistors. Among these possibilities, magnetic based technologies, like NanoMagnet Logic (NML) are among the most interesting. The reason of this interest lies in their magnetic nature, that opens up entire new possibilities in the design of logic circuits, like the possibility to mix logic and memory in the same device. Moreover they have no standby power consumption and potentially a much lower power consumption of CMOS transistors. In literature NML logic is well studied and theoretical and experimental proofs of concept were already found. However two important points are not enough considered in the analysis approach followed by most of the work in literature. First of all, no complex circuits are analyzed. NML logic is very different from CMOS technologies, so to completely understand the potential of this technology it is mandatory to investigate complex architectures. Secondly, most of the solutions proposed do not take into account the constraints derived from fabrication process, making them unrealistic and difficult to be fabricated experimentally. This thesis focuses therefore on NML logic keeping into account these two important limitations in the research approach followed in literature. The aim is to obtain a complete and accurate overview of NML logic, finding realistic circuital solutions and trying to improve at the same time their performance. After a brief and complete introduction (Chapter 1), the thesis is divided in two parts, which cover the two fundamental points followed in this three years of research: A circuits architecture analysis and a technological analysis. In the architecture analysis first an innovative VHDL model is described in Chapter 2. This model is extensively used in the analysis because it allows fast simulation of complex circuits, with, at the same time, the possibility to estimate circuit per- formance, like area and power consumption. In Chapter 3 the problem of signals synchronization in complex NML circuits is analyzed and solved, using as benchmark a simple but complete NML microprocessor. Different solutions based on asynchronous logic are studied and a new asynchronous solution, specifically designed to exploit the potential of NML logic, is developed. In Chapter 4 the layout of NML circuits is studied on a more physical level, considering the limitations of fabrication processes. The layout of NML circuits is therefore changed accordingly to these constraints. Secondly CMOS circuits architectures are compared to more simple architectures, evaluating therefore which one is more suited for NML logic. Finally the problem of interconnections in NML technology is analyzed and solutions to improve it are found. In Chapter 5 the problem of feedback signals in heavy pipelined technologies, like NML, is studied. Solutions to improve performances and synchronize signals are developed. Systolic arrays are then analyzed as possible candidate to exploit NML potential. Finally in Chapter 6 ToPoliNano, a simulator dedicated to NML and other emerging technologies, that we are developing, is described. This simulator allows to follow the same top-down approach followed for CMOS technology. The layout generator and the simulation engine are detailed described. In the first chapter of the technological analysis (Chapter 7), the performance of NML logic is explored throughout low level simulations. The aim is to understand if these circuits can be fabricated with optical lithography, allowing therefore the commercial development of NML logic. Basic logic gates and the clock system are there analyzed from a low level perspective. In Chapter 8 an innovative electric clock system for NML technology is shown and the first experimental results are reported. This clock system allows to achieve true low power for NML technology, obtaining a reduction of power consumption of 20 times considering the best CMOS transistors available. This power consumption takes into account all the losses, also the clock system losses. Moreover the solution presented can be fabricated with current technological processes. The research work behind this thesis represents an important breakthrough in NML logic. The solutions here presented allow the design and fabrication of complex NML circuits, considering the particular characteristics of this technology and considerably improving the performance. Moreover the technological solutions here presented allow the design and fabrication of circuits with available fabrication process with a considerable advantage over CMOS in terms of power consumption. This thesis represents therefore a considerable step froward in the study and development of NML technolog

    A Practical Hierarchical Model of Parallel Computation ll: Binary Tree and FFT Algorithms

    Get PDF
    A companion paper has introduced the Hierarchical PRAM (H-PRAM) model of parallel computation, which achieves a good balance between simplicity of usage and reflectivity of realistic parallel computers. In this paper, we demonstrate the usage of the model by designing and analyzing various algorithms for computing the complete binary tree, and the FFT/butterfly graph. By concentrating on two problems, we are able to demonstrate the results of different combinations of organizational strategies and different types of sub-models of the H-PRAM. The philosophy in algorithm design is to maximize the number of processors P that are efficiently usable with respect to an input size N, and to minimize the inefficiency when efficiency is not possible (when P is too large with respect to N). This can be done because of the H-PRAM\u27s representation of general locality, i.e. both strict and neighborhood locality, and results in algorithms that can efficiently employ more processors (and are thus faster) than algorithms for models that only represent strict locality

    Characterization and Avoidance of Critical Pipeline Structures in Aggressive Superscalar Processors

    Get PDF
    In recent years, with only small fractions of modern processors now accessible in a single cycle, computer architects constantly fight against propagation issues across the die. Unfortunately this trend continues to shift inward, and now the even most internal features of the pipeline are designed around communication, not computation. To address the inward creep of this constraint, this work focuses on the characterization of communication within the pipeline itself, architectural techniques to avoid it when possible, and layout co-design for early detection of problems. I present work in creating a novel detection tool for common case operand movement which can rapidly characterize an applications dataflow patterns. The results produced are suitable for exploitation as a small number of patterns can describe a significant portion of modern applications. Work on dynamic dependence collapsing takes the observations from the pattern results and shows how certain groups of operations can be dynamically grouped, avoiding unnecessary communication between individual instructions. This technique also amplifies the efficiency of pipeline data structures such as the reorder buffer, increasing both IPC and frequency. I also identify the same sets of collapsible instructions at compile time, producing the same benefits with minimal hardware complexity. This technique is also done in a backward compatible manner as the groups are exposed by simple reordering of the binarys instructions. I present aggressive pipelining approaches for these resources which avoids the critical timing often presumed necessary in aggressive superscalar processors. As these structures are designed for the worst case, pipelining them can produce greater frequency benefit than IPC loss. I also use the observation that the dynamic issue order for instructions in aggressive superscalar processors is predictable. Thus, a hardware mechanism is introduced for caching the wakeup order for groups of instructions efficiently. These wakeup vectors are then used to speculatively schedule instructions, avoiding the dynamic scheduling when it is not necessary. Finally, I present a novel approach to fast and high-quality chip layout. By allowing architects to quickly evaluate what if scenarios during early high-level design, chip designs are less likely to encounter implementation problems later in the process.Ph.D.Committee Chair: Scott Wills; Committee Member: David Schimmel; Committee Member: Gabriel Loh; Committee Member: Hsien-Hsin Lee; Committee Member: Yorai Ward
    • …
    corecore