2,893 research outputs found

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    Asynchronous Graph Pattern Matching on Multiprocessor Systems

    Full text link
    Pattern matching on large graphs is the foundation for a variety of application domains. Strict latency requirements and continuously increasing graph sizes demand the usage of highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, graph partitioning becomes increasingly important. Hence, we present a technique to process graph pattern matching on NUMA systems in this paper. As a scalable pattern matching processing infrastructure, we leverage a data-oriented architecture that preserves data locality and minimizes concurrency-related bottlenecks on NUMA systems. We show in detail, how graph pattern matching can be asynchronously processed on a multiprocessor system.Comment: 14 Pages, Extended version for ADBIS 201

    A partitioning strategy for nonuniform problems on multiprocessors

    Get PDF
    The partitioning of a problem on a domain with unequal work estimates in different subddomains is considered in a way that balances the work load across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. A binary decomposition of the domain is used to partition it into rectangles requiring equal computational effort. The communication costs of mapping this partitioning onto different microprocessors: a mesh-connected array, a tree machine and a hypercube is then studied. The communication cost expressions can be used to determine the optimal depth of the above partitioning

    A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms

    Full text link
    Multiple Sequences Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing number of sequences. On the other hand, with the advent of new breed of fast sequencing techniques it is now possible to generate thousands of sequences very quickly. For rapid sequence analysis, it is therefore desirable to develop fast MSA algorithms that scale well with the increase in the dataset size. In this paper, we present a novel domain decomposition based technique to solve the MSA problem on multiprocessing platforms. The domain decomposition based technique, in addition to yielding better quality, gives enormous advantage in terms of execution time and memory requirements. The proposed strategy allows to decrease the time complexity of any known heuristic of O(N)^x complexity by a factor of O(1/p)^x, where N is the number of sequences, x depends on the underlying heuristic approach, and p is the number of processing nodes. In particular, we propose a highly scalable algorithm, Sample-Align-D, for aligning biological sequences using Muscle system as the underlying heuristic. The proposed algorithm has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of quality of alignment, execution time and speed-up.Comment: 36 pages, 17 figures, Accepted manuscript in Journal of Parallel and Distributed Computing(JPDC

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers
    corecore