1,765 research outputs found
A taxonomy of parallel sorting
TR 84-601In this paper, we propose a taxonomy of parallel sorting that includes a broad range of array
and file sorting algorithms. We analyze the evolution of research on parallel sorting, from the
earliest sorting networks to the shared memory algorithms and the VLSI sorters. In the context
of sorting networks, we describe two fundamental parallel merging schemes - the odd-even and
the bitonic merge. Sorting algorithms have been derived from these merging algorithms for parallel
computers where processors communicate through interconnection networks such as the perfect
shuffle, the mesh and a number of other sparse networks. After describing the network sorting
algorithms, we show that, with a shared memory model of parallel computation, faster algorithms
have been derived from parallel enumeration sorting schemes, where keys are first ranked and
then rearranged according to their rank
On the average running time of odd-even merge sort
This paper is concerned with the average running time of Batcher's odd-even merge sort when implemented on a collection of processors. We consider the case where , the size of the input, is an arbitrary multiple of the number of processors used. We show that Batcher's odd-even merge (for two sorted lists of length each) can be implemented to run in time on the average, and that odd-even merge sort can be implemented to run in time on the average. In the case of merging (sorting), the average is taken over all possible outcomes of the merging (all possible permutations of elements). That means that odd-even merge and odd-even merge sort have an optimal average running time if . The constants involved are also quite small
Energy-Efficient Algorithms
We initiate the systematic study of the energy complexity of algorithms (in
addition to time and space complexity) based on Landauer's Principle in
physics, which gives a lower bound on the amount of energy a system must
dissipate if it destroys information. We propose energy-aware variations of
three standard models of computation: circuit RAM, word RAM, and
transdichotomous RAM. On top of these models, we build familiar high-level
primitives such as control logic, memory allocation, and garbage collection
with zero energy complexity and only constant-factor overheads in space and
time complexity, enabling simple expression of energy-efficient algorithms. We
analyze several classic algorithms in our models and develop low-energy
variations: comparison sort, insertion sort, counting sort, breadth-first
search, Bellman-Ford, Floyd-Warshall, matrix all-pairs shortest paths, AVL
trees, binary heaps, and dynamic arrays. We explore the time/space/energy
trade-off and develop several general techniques for analyzing algorithms and
reducing their energy complexity. These results lay a theoretical foundation
for a new field of semi-reversible computing and provide a new framework for
the investigation of algorithms.Comment: 40 pages, 8 pdf figures, full version of work published in ITCS 201
Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes
With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized.
In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented.
An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced.
Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times
Algorithm Development and VLSI Implementation of Energy Efficient Decoders of Polar Codes
With its low error-floor performance, polar codes attract significant attention as the potential standard error correction code (ECC) for future communication and data storage. However, the VLSI implementation complexity of polar codes decoders is largely influenced by its nature of in-series decoding. This dissertation is dedicated to presenting optimal decoder architectures for polar codes. This dissertation addresses several structural properties of polar codes and key properties of decoding algorithms that are not dealt with in the prior researches. The underlying concept of the proposed architectures is a paradigm that simplifies and schedules the computations such that hardware is simplified, latency is minimized and bandwidth is maximized.
In pursuit of the above, throughput centric successive cancellation (TCSC) and overlapping path list successive cancellation (OPLSC) VLSI architectures and express journey BP (XJBP) decoders for the polar codes are presented.
An arbitrary polar code can be decomposed by a set of shorter polar codes with special characteristics, those shorter polar codes are referred to as constituent polar codes. By exploiting the homogeneousness between decoding processes of different constituent polar codes, TCSC reduces the decoding latency of the SC decoder by 60% for codes with length n = 1024. The error correction performance of SC decoding is inferior to that of list successive cancellation decoding. The LSC decoding algorithm delivers the most reliable decoding results; however, it consumes most hardware resources and decoding cycles. Instead of using multiple instances of decoding cores in the LSC decoders, a single SC decoder is used in the OPLSC architecture. The computations of each path in the LSC are arranged to occupy the decoder hardware stages serially in a streamlined fashion. This yields a significant reduction of hardware complexity. The OPLSC decoder has achieved about 1.4 times hardware efficiency improvement compared with traditional LSC decoders. The hardware efficient VLSI architectures for TCSC and OPLSC polar codes decoders are also introduced.
Decoders based on SC or LSC algorithms suffer from high latency and limited throughput due to their serial decoding natures. An alternative approach to decode the polar codes is belief propagation (BP) based algorithm. In BP algorithm, a graph is set up to guide the beliefs propagated and refined, which is usually referred to as factor graph. BP decoding algorithm allows decoding in parallel to achieve much higher throughput. XJBP decoder facilitates belief propagation by utilizing the specific constituent codes that exist in the conventional factor graph, which results in an express journey (XJ) decoder. Compared with the conventional BP decoding algorithm for polar codes, the proposed decoder reduces the computational complexity by about 40.6%. This enables an energy-efficient hardware implementation. To further explore the hardware consumption of the proposed XJBP decoder, the computations scheduling is modeled and analyzed in this dissertation. With discussions on different hardware scenarios, the optimal scheduling plans are developed. A novel memory-distributed micro-architecture of the XJBP decoder is proposed and analyzed to solve the potential memory access problems of the proposed scheduling strategy. The register-transfer level (RTL) models of the XJBP decoder are set up for comparisons with other state-of-the-art BP decoders. The results show that the power efficiency of BP decoders is improved by about 3 times
- …