2,893 research outputs found
Scalability of broadcast performance in wireless network-on-chip
Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
Asynchronous Graph Pattern Matching on Multiprocessor Systems
Pattern matching on large graphs is the foundation for a variety of
application domains. Strict latency requirements and continuously increasing
graph sizes demand the usage of highly parallel in-memory graph processing
engines that need to consider non-uniform memory access (NUMA) and concurrency
issues to scale up on modern multiprocessor systems. To tackle these aspects,
graph partitioning becomes increasingly important. Hence, we present a
technique to process graph pattern matching on NUMA systems in this paper. As a
scalable pattern matching processing infrastructure, we leverage a
data-oriented architecture that preserves data locality and minimizes
concurrency-related bottlenecks on NUMA systems. We show in detail, how graph
pattern matching can be asynchronously processed on a multiprocessor system.Comment: 14 Pages, Extended version for ADBIS 201
A partitioning strategy for nonuniform problems on multiprocessors
The partitioning of a problem on a domain with unequal work estimates in different subddomains is considered in a way that balances the work load across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. A binary decomposition of the domain is used to partition it into rectangles requiring equal computational effort. The communication costs of mapping this partitioning onto different microprocessors: a mesh-connected array, a tree machine and a hypercube is then studied. The communication cost expressions can be used to determine the optimal depth of the above partitioning
A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms
Multiple Sequences Alignment (MSA) of biological sequences is a fundamental
problem in computational biology due to its critical significance in wide
ranging applications including haplotype reconstruction, sequence homology,
phylogenetic analysis, and prediction of evolutionary origins. The MSA problem
is considered NP-hard and known heuristics for the problem do not scale well
with increasing number of sequences. On the other hand, with the advent of new
breed of fast sequencing techniques it is now possible to generate thousands of
sequences very quickly. For rapid sequence analysis, it is therefore desirable
to develop fast MSA algorithms that scale well with the increase in the dataset
size. In this paper, we present a novel domain decomposition based technique to
solve the MSA problem on multiprocessing platforms. The domain decomposition
based technique, in addition to yielding better quality, gives enormous
advantage in terms of execution time and memory requirements. The proposed
strategy allows to decrease the time complexity of any known heuristic of
O(N)^x complexity by a factor of O(1/p)^x, where N is the number of sequences,
x depends on the underlying heuristic approach, and p is the number of
processing nodes. In particular, we propose a highly scalable algorithm,
Sample-Align-D, for aligning biological sequences using Muscle system as the
underlying heuristic. The proposed algorithm has been implemented on a cluster
of workstations using MPI library. Experimental results for different problem
sizes are analyzed in terms of quality of alignment, execution time and
speed-up.Comment: 36 pages, 17 figures, Accepted manuscript in Journal of Parallel and
Distributed Computing(JPDC
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
- …