6 research outputs found

    Breaking serialization in lock-free multicore synchronization

    Get PDF
    In multicores, performance-critical synchronization is increasingly performed in a lock-free manner using atomic instructions such as CAS or LL/SC. However, when many processors synchronize on the same variable, performance can still degrade significantly. Contending writes get serialized, creating a non-scalable condition. Past proposals that build hardware queues of synchronizing processors do not fundamentally solve this problem. At best, they help to efficiently serialize the contending writes. We propose a novel architecture that breaks the serialization of hardware queues and enables the queued processors to perform lock-free synchronization in parallel. The architecture, called Caspar, is able to (1) execute the CASes in the queued-up processors in parallel through eager forwarding of expected values, and (2) validate the CASes in parallel and dequeue groups of processors at a time. The result is highly scalable synchronization. We evaluate Caspar with simulations of a 64-core chip. Compared to existing proposals with hardware queues, Caspar improves the throughput of kernels by 32% on average and reduces the execution time of the sections considered in lock-free versions of applications by 47% on average. This makes these sections 2.5x faster than in the original applications

    ILP and TLP in Shared Memory Applications: A Limit Study

    Get PDF
    The work in this dissertation explores the limits of Chip-multiprocessors (CMPs) with respect to shared-memory, multi-threaded benchmarks, which will help aid in identifying microarchitectural bottlenecks. This, in turn, will lead to more efficient CMP design. In the first part we introduce DotSim, a trace-driven toolkit designed to explore the limits of instruction and thread-level scaling and identify microarchitectural bottlenecks in multi-threaded applications. DotSim constructs an instruction-level Data Flow Graph (DFG) from each thread in multi-threaded applications, adjusting for inter-thread dependencies. The DFGs dynamically change depending on the microarchitectural constraints applied. Exploiting these DFGs allows for the easy extraction of the performance upper bound. We perform a case study on modeling the upper-bound performance limits of a processor microarchitecture modeled off a AMD Opteron. In the second part, we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insight into the upper bounds of performance that future architectures can achieve. Furthermore, it identifies the bottlenecks of emerging workloads. To the best of our knowledge, our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using DotSim. We make several contributions describing the high-level behavior of next-generation applications. For example, we show that these applications contain up to a factor of 929X more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP), instruction window size, realistic branch prediction, realistic memory latency, and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differ vastly from one another. As a result, we expect that no single, homogeneous, micro-architecture will work optimally for all, arguing for reconfigurable, heterogeneous designs. In the third part of this thesis, we use our novel simulator DotSim to study the benefits of prefetching shared memory within critical sections. In this chapter we calculate the upper bound of performance under our given constraints. Our intent is to provide motivation for new techniques to exploit the potential benefits of reducing latency of shared memory among threads. We conduct an idealized workload characterization study focusing on the data that is truly shared among threads, using a simplified memory model. We explore the degree of shared memory criticality, and characterize the benefits of being able to use latency reducing techniques to reduce execution time and increase ILP. We find that on average true sharing among benchmarks is quite low compared to overall memory accesses on the critical path and overall program. We also find that truly shared memory between threads does not affect the critical path for the majority of benchmarks, and when it does the impact is less than 1%. Therefore, we conclude that it is not worth exploring latency reducing techniques of truly shared memory within critical sections

    Proceedings of the 11th international Conference on Cognitive Modeling : ICCM 2012

    Get PDF
    The International Conference on Cognitive Modeling (ICCM) is the premier conference for research on computational models and computation-based theories of human behavior. ICCM is a forum for presenting, discussing, and evaluating the complete spectrum of cognitive modeling approaches, including connectionism, symbolic modeling, dynamical systems, Bayesian modeling, and cognitive architectures. ICCM includes basic and applied research, across a wide variety of domains, ranging from low-level perception and attention to higher-level problem-solving and learning. Online-Version published by Universitätsverlag der TU Berlin (www.univerlag.tu-berlin.de

    Performance analysis for wireless G (IEEE 802.11G) and wireless N (IEEE 802.11N) in outdoor environment

    Get PDF
    This paper described an analysis the different capabilities and limitation of both IEEE technologies that has been utilized for data transmission directed to mobile device. In this work, we have compared an IEEE 802.11/g/n outdoor environment to know what technology is better. The comparison consider on coverage area (mobility), throughput and measuring the interferences. The work presented here is to help the researchers to select the best technology depending of their deploying case, and investigate the best variant for outdoor. The tool used is Iperf software which is to measure the data transmission performance of IEEE 802.11n and IEEE 802.11g

    Performance Analysis For Wireless G (IEEE 802.11 G) And Wireless N (IEEE 802.11 N) In Outdoor Environment

    Get PDF
    This paper described an analysis the different capabilities and limitation of both IEEE technologies that has been utilized for data transmission directed to mobile device. In this work, we have compared an IEEE 802.11/g/n outdoor environment to know what technology is better. the comparison consider on coverage area (mobility), through put and measuring the interferences. The work presented here is to help the researchers to select the best technology depending of their deploying case, and investigate the best variant for outdoor. The tool used is Iperf software which is to measure the data transmission performance of IEEE 802.11n and IEEE 802.11g
    corecore