7,124 research outputs found
PULP-HD: Accelerating Brain-Inspired High-Dimensional Computing on a Parallel Ultra-Low Power Platform
Computing with high-dimensional (HD) vectors, also referred to as
, is a brain-inspired alternative to computing with
scalars. Key properties of HD computing include a well-defined set of
arithmetic operations on hypervectors, generality, scalability, robustness,
fast learning, and ubiquitous parallel operations. HD computing is about
manipulating and comparing large patterns-binary hypervectors with 10,000
dimensions-making its efficient realization on minimalistic ultra-low-power
platforms challenging. This paper describes HD computing's acceleration and its
optimization of memory accesses and operations on a silicon prototype of the
PULPv3 4-core platform (1.5mm, 2mW), surpassing the state-of-the-art
classification accuracy (on average 92.4%) with simultaneous 3.7
end-to-end speed-up and 2 energy saving compared to its single-core
execution. We further explore the scalability of our accelerator by increasing
the number of inputs and classification window on a new generation of the PULP
architecture featuring bit-manipulation instruction extensions and larger
number of 8 cores. These together enable a near ideal speed-up of 18.4
compared to the single-core PULPv3
Design and Analysis of an Estimation of Distribution Approximation Algorithm for Single Machine Scheduling in Uncertain Environments
In the current work we introduce a novel estimation of distribution algorithm
to tackle a hard combinatorial optimization problem, namely the single-machine
scheduling problem, with uncertain delivery times. The majority of the existing
research coping with optimization problems in uncertain environment aims at
finding a single sufficiently robust solution so that random noise and
unpredictable circumstances would have the least possible detrimental effect on
the quality of the solution. The measures of robustness are usually based on
various kinds of empirically designed averaging techniques. In contrast to the
previous work, our algorithm aims at finding a collection of robust schedules
that allow for a more informative decision making. The notion of robustness is
measured quantitatively in terms of the classical mathematical notion of a norm
on a vector space. We provide a theoretical insight into the relationship
between the properties of the probability distribution over the uncertain
delivery times and the robustness quality of the schedules produced by the
algorithm after a polynomial runtime in terms of approximation ratios
Second-generation PLINK: rising to the challenge of larger and richer datasets
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the
steady accumulation of data from imputation and whole-genome sequencing studies
has exposed a strong need for even faster and more scalable implementations of
key functions. In addition, GWAS and population-genetic data now frequently
contain probabilistic calls, phase information, and/or multiallelic variants,
none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for
PLINK. The first major release from this codebase, PLINK 1.9, introduces
extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space
Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic
improvements. In combination, these changes accelerate most operations by 1-4
orders of magnitude, and allow the program to handle datasets too large to fit
in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data
format capable of efficiently representing probabilities, phase, and
multiallelic variants, and (b) extensions of many functions to account for the
new types of information.
The second-generation versions of PLINK will offer dramatic improvements in
performance and compatibility. For the first time, users without access to
high-end computing resources can perform several essential analyses of the
feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil
Exploration of the scalability of SIMD processing for software defined radio
The idea of software defined radio (SDR) describes a signal processing system for wireless
communications that allows performing major parts of the physical layer processing in
software. SDR systems are more flexible and have lower development costs than traditional
systems based on application-specific integrated circuits (ASICs). Yet, SDR requires
programmable processor architectures that can meet the throughput and energy efficiency
requirements of current third generation (3G) and future fourth generation (4G) wireless
standards for mobile devices.
Single instruction, multiple data (SIMD) processors operate on long data vectors in parallel
data lanes and can achieve a good ratio of computing power to energy consumption. Hence,
SIMD processors could be the basis of future SDR systems. Yet, SIMD processors only
achieve a high efficiency if all parallel data lanes can be utilized.
This thesis investigates the scalability of SIMD processing for algorithms required in 4G
wireless systems; i. e. the scaling of performance and energy consumption with increasing
SIMD vector lengths is explored. The basis of the exploration is a scalable SIMD processor
architecture, which also supports long instruction word (LIW) execution and can be
configured with four different permutation networks for vector element permutations.
Radix-2 and mixed-radix fast Fourier transform (FFT) algorithms, sphere decoding for
multiple input, multiple output (MIMO) systems, and the decoding of quasi-cyclic lowdensity
parity check (LDPC) codes have been examined, as these are key algorithms for
4G wireless systems. The results show that the performance of all algorithms scales with
the SIMD vector length, yet there are different constraints on the ratios between algorithm
and architecture parameters. The radix-2 FFT algorithm allows close to linear speedups
if the FFT size is at least twice the SIMD vector length, the mixed-radix FFT algorithm
requires the FFT size to be a multiple of the squared SIMD width. The performance of
the implemented sphere decoding algorithm scales linearly with the SIMD vector length.
The scalability of LDPC decoding is determined by the expansion factor of the quasicyclic
code. Wider SIMD processors offer better performance and also require less energy
than processors with a shorter vector length for all considered algorithms. The results for
different permutations networks show that a simple permutation network is sufficient for
most applications
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
- …