13 research outputs found

    Data-driven control scheme for linear arrays: Application to a stable insertion sorter

    No full text
    Abstract—We present a strategy for designing stable insertion sorters based on linear arrays with data-driven control. The novelty of our approach lies in each data item carrying a control tag to specify how it is to be operated upon by a receiving cell and in performing two parallel comparisons within each cell. To assure first-in/first-out handling of equal key values, some data items must be marked to reflect their past histories. Such marking is conveniently carried out by modifying the data item’s control tag. It is the combination of the above features that allows us to derive the first single-cycle priority queue that operates in fully pipelined mode, with no broadcasting of data values or control signals. By performing more than two parallel comparisons in each cell, the VLSI implementation cost of our stable sorter can be reduced. We show that highly cost-effective designs can be obtained by selecting an optimal cell size in terms of the number of comparators it contains

    A unified formulation of honeycomb and diamond networks

    No full text
    AbstractÐHoneycomb and diamond networks have been proposed as alternatives to mesh and torus architectures for parallel processing. When wraparound links are included in honeycomb and diamond networks, the resulting structures can be viewed as having been derived via a systematic pruning scheme applied to the links of 2D and 3D tori, respectively. The removal of links, which is performed along a diagonal pruning direction, preserves the network's node-symmetry and diameter, while reducing its implementation complexity and VLSI layout area. In this paper, we prove that honeycomb and diamond networks are special subgraphs of complete 2D and 3D tori, respectively, and show this viewpoint to hold important implications for their physical layouts and routing schemes. Because pruning reduces the node degree without increasing the network diameter, the pruned networks have an advantage when the degree-diameter product is used as a figure of merit. Additionally, if the reduced node degree is used as an opportunity to increase the link bandwidths to equalize the costs of pruned and unpruned networks, a gain in communication performance may result. Index TermsÐCayley graph, k-ary n-cube, network topology, processor array, pruned torus network, VLSI layout.

    Incomplete k-ary n-cube and Its Derivatives

    No full text
    Incomplete or pruned k-ary n-cube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of 4 instead of the original 2n and results in regular networks that are Cayley graphs, provided that n 1 divides k: For n 3 ðn 5Þ; the preceding restriction is not problematic, as it only requires that k be even (a multiple of 4). In other cases, changes to the basis network to be pruned, or to the pruning algorithm, can mitigate the problem. Incomplete k-ary n-cube maintains a number of desirable topological properties of its unpruned counterpart despite having fewer links. It is maximally connected, has diameter and fault diameter very close to those of k-ary n-cube, and an average internode distance that is only slightly greater. Hence, the cost/performance tradeoffs offered by our pruning scheme can in fact lead to useful, and practically realizable, parallel architectures. We study pruned k-ary n-cubes in general and offer some additional results for the special case n 3

    Parallel Architectures and Adaptation Algorithms for Programmable FIR Digital Filters With Fully Pipelined Data and Control Flows

    No full text
    Previous designs of programmable finite impulse response (FIR) digital filters have demonstrated that the use of broadcast input data and control can lead to a high performance-to-cost ratio. As the technology moves into deeper submicron regimes, this approach should be reexamined by paying greater attention to the effect of interconnects. In this paper, we quantify the contribution of interconnect delay to the cycle time and demonstrate its negative effects on both scalability and cost-effectiveness of such broadcast designs. We further show how speed and density improvements secured through technology scaling can be maintained by a fully pipelined design in which both data and control signals are restricted to local connections. One important feature of our design is that the data input port is reused for delivering the new coefficients. Consequently, coefficients can be loaded in bit-parallel form with no increase in the number of input pins, thereby facilitating and speeding up run-time adaptation to the application environment. Another feature is that variable-precision coefficients can be accommodated easily and flexibly, with no speed penalty. Because the inner-product computation at the heart of a FIR filter occurs in many other signal processing applications, our design methods and conclusions are widely applicable to the design of application-specific and embedded parallel architectures

    Periodically regular chordal rings

    No full text

    Comparing Four Classes of Torus-Based Parallel Architectures: Network:Parameters and Communication Performance

    No full text
    Abstract--The relative communication performance of low- versus high-dimensional torus net-works (k-ary n-cubes) has been extensively studied under various assumptions about communication patterns and technological constraints. In this paper, we extend the comparison to torus networks with incomplete, but regular, connectivities. Taking an nD torus as the basis, we show that a simple pruning scheme can be used to reduce the node degree from 2n to 4, while preserving many of the desirable properties of the intact network. Orienting the torus links (removing half of the channels) provides a second form of pruning that leads to (multidimensional) Manhattan street networks. Fi-nally, combined pruning and orientation yields the fourth class of toroidal networks studied here. We compare the static performance parameters of these networks and evaluate their dynamic communi-cation performance under the assumptions of virtual cut-through switching and constant pin count. The 3D case, leading to networks that are efficiently realizable with current technology, is used to demonstrate and quantify the performance benefits. Our results reinforce, extend, and complement previous studies that have demonstrated the performance advantages of low-dimensional k-ary n-cubes over higher-dimensional ones. For example pruned 3D tori provide additional design points that fall between 2D and 3D tori in terms of implementation complexity but can outperform both of these standard architectures. Thus, from a practical standpoint, pruning introduces additional flexibility in implementation options and trade-offs available to designers. (~) 2004 Elsevier Ltd. All rights reserved. Keywords--Analytic performance evaluation, Incomplete torus, k-ary n-cube, Manhattan stree

    Scalable Linear Array Architecture with Data-Driven Control for Ultrahigh-Speed Vector Quantization

    No full text
    Abstract. Current and future requirements for adaptive real-time image compression challenge even the capabilities of highly parallel realizations in terms of hardware performance. Previously proposed linear array structures for full-search vector quantization do not offer scalability and adaptivity in this context, because they require separate data/control pins for dynamically updating the codevectors and complicated interlock mechanisms to ensure that the regular data flow is not corrupted as a result of updates. We explore the design space for full-search vector quantizers and propose a novel linear processor array architecture in which global wiring is limited to clock and power supply distribution, thus allowing high-speed processing in spite of only limited communication with the host via the boundary processors. The resulting fully pipelined design is not only area-efficient for VLSI implementation but is also readily scalable and offers extremely high performance
    corecore