47,145 research outputs found
Recommended from our members
Percolation scheduling for non-VLIW machines
Percolation Scheduling, a technique for compile-time code parallelization, has proven very successful for exploiting fine-grain irregular parallelism in ordinary programs. Currently, this technology is targeted only to VLIW (Very Long Instruction Word) machines, which have the advantages of 'free' synchronization and communication. Shared memory multi-processors can simulate the execution characteristics of VLIW machines with the use of static barriers. Preliminary results show that Percolation Scheduling can be used with good results on this type of architecture by increasing the granularity from operation level to source statement level, removing any redundant synchronization, and providing an efficient implementation of multi-way jumps
A system-on-chip vector multiprocessor for transmission line modelling acceleration
We discuss a configurable, System-on-Chip vector
multiprocessor for accelerating the Transmission Line
Modeling (TLM) algorithm with an architecture capable of
exploiting the two primary forms of parallelism in the code,
thread and data level parallelism. Theoretical results
demonstrate an order of magnitude reduction in the dynamic
instruction count for a scalar-processor/vector-coprocessor
configuration at a vector length of sixteen 32-bit singleprecision
elements. Furthermore, a multi-vector SoC
architecture consisting of ten such vector accelerators provides
a near-linear theoretical performance benefit of the order of
88% in three out of four benchmark configurations which is
orthogonal to the benefit realized by vectorization alone. We
discuss in detail this potent architecture and present
implementation data for the 2-way multi-processor VLSI
macrocell
Performance Enhancement of Multicore Processors using Dynamic Load Balancing
Introduction of Multi-core Architecture has opened new area for researchers where dynamic load balancing can be applied to distribute the work load among the cores. Multi-core Architecture provides hardware parallelism through cores inside CPU. Its increased performance and low cost as compared to single-core machines, attracts High Performance Computing (HPC) community. The paper proposes a user level dynamic load balancing model for multi-core processors using Java multi-threading and use of Java I/O framework for I/O operations
LeoPARD --- A Generic Platform for the Implementation of Higher-Order Reasoners
LeoPARD supports the implementation of knowledge representation and reasoning
tools for higher-order logic(s). It combines a sophisticated data structure
layer (polymorphically typed {\lambda}-calculus with nameless spine notation,
explicit substitutions, and perfect term sharing) with an ambitious multi-agent
blackboard architecture (supporting prover parallelism at the term, clause, and
search level). Further features of LeoPARD include a parser for all TPTP
dialects, a command line interpreter, and generic means for the integration of
external reasoners.Comment: 6 pages, to appear in the proceedings of CICM'2015 conferenc
Configurable and Scalable High Throughput Turbo Decoder Architecture for Multiple 4GWireless Standards
In this paper, we propose a novel multi-code turbo decoder architecture for 4G wireless systems. To support various 4G standards, a configurable multi-mode MAP (maximum a posteriori) decoder is designed for both binary and duo-binary turbo codes with small resource overhead (less than 10%) compared to the single-mode architecture. To achieve high data rates in 4G, we present a parallel turbo decoder architecture with scalable parallelism tailored to the given throughput requirements. High-level parallelism is achieved by employing contention-free interleavers. Multi-banked memory structure and routing network among memories and MAP decoders are designed to operate at full speed with parallel interleavers. We designed a very low-complexity recursive on-line address generator supporting multiple interleaving patterns, which avoids the interleaver address memory. Design trade-offs in terms of area and power efficiency are explored to find the optimal architectures. A 711 Mbps data rate is feasible with 32 Radix-4 MAP decoders running at 200 MHz clock rate.Texas Instruments Incorporate
Exploiting different levels of parallelism in the biological sequence comparison problem
In the last years the fast growth of bioinformatics field has atracted the attention of computer scientists. At the same
time, de exponential growth of databases that contains biological information (such as protein and DNA data) demands great efforts to improve the performance of computational platforms. In this work, we investigate how bioinformatics applications benefit from parallel architectures that combine different alternatives to exploit coarse- and fine-grain parallelism. As a case of analysis, we study the performance behavior of the Ssearch application that implements the Smith-Waterman algorithm (SW), which is a dynamic programing
approach that explores the similarity between a pair of sequences. The inherent large parallelism of the application
makes it ideal for architectures supporting multiple dimensions of parallelism (thread-level parallelism, TLP; data-level
parallelism, DLP; instruction-level parallelism, ILP). We study how this algorithm can take advantage of different parallel machines like the SGI Altix, IBM Power6, IBM Cell BE and MareNostrum machines. Our study includes a qualitative analysis of the parallelization opportunities and also the quantification of the performance in terms of speedup and
execution time. These measures are collected taking into account the specific characteristics of each architecture. As
an example, our results show that a share memory multiprocessor architecture (SMP) like the PowerPC 970MP of Marenostrum machine can surpasses a heterogeneous multi-
processor machine like the current IBM Cell BE.Peer ReviewedPostprint (published version
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version
High performance deep packet inspection on multi-core platform
Deep packet inspection (DPI) provides the ability to perform quality of service (QoS) and Intrusion Detection on network packets. But since the explosive growth of Internet, performance and scalability issues have been raised due to the gap between network and end-system speeds. This article describles how a desirable DPI system with multi-gigabits throughput and good scalability should be like by exploiting parallelism on network interface card, network stack and user applications. Connection-based parallelism, affinity-based scheduling and lock-free data structure are the main technologies introduced to alleviate the performance and scalability issues. A common DPI application L7-Filter is used as an example to illustrate the applicaiton level parallelism
Implementing PRISMA/DB in an OOPL
PRISMA/DB is implemented in a parallel object-oriented language to gain insight in the usage of parallelism. This environment allows us to experiment with parallelism by simply changing the allocation of objects to the processors of the PRISMA machine. These objects are obtained by a strictly modular design of PRISMA/DB. Communication between the objects is required to cooperatively handle the various tasks, but it limits the potential for parallelism. From this approach, we hope to gain a better understanding of parallelism, which can be used to enhance the performance of PRISMA/DB.\ud
The work reported in this document was conducted as part of the PRISMA project, a joint effort with Philips Research Eindhoven, partially supported by the Dutch "Stimuleringsprojectteam Informaticaonderzoek (SPIN)
- …