646 research outputs found

    Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS

    Get PDF

    Comparative study of a time diversity scheme applied to G3 systems for narrowband power-line communications

    Get PDF
    A dissertation submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in ful lment of the requirements for the degree of Masters of Science in Engineering (Electrical). Johannesburg, 2016Power-line communications can be used for the transfer of data across electrical net- works in applications such as automatic meter reading in smart grid technology. As the power-line channel is harsh and plagued with non-Gaussian noise, robust forward error correction schemes are required. This research is a comparative study where a Luby transform code is concatenated with power-line communication systems provided by an up-to-date standard published by electricit e R eseau Distribution France named G3 PLC. Both decoding using Gaussian elimination and belief propagation are imple- mented to investigate and characterise their behaviour through computer simulations in MATLAB. Results show that a bit error rate performance improvement is achiev- able under non worst-case channel conditions using a Gaussian elimination decoder. An adaptive system is thus recommended which decodes using Gaussian elimination and which has the appropriate data rate. The added complexity can be well tolerated especially on the receiver side in automatic meter reading systems due to the network structure being built around a centralised agent which possesses more resources.MT201

    HARS, a Heterogeneity-Aware Runtime System for Self-Adaptive Multithreaded Applications

    Get PDF
    Department of Computer EngineeringWe are in the age of rapid changes. In particular, the computer is not just a device for carrying out complex computations. Various users can use these devices in various ways. Someone could just use a computer for document writing, while others can use one for video encoding or playing computer games. Each process has different computational demands. If we use a supercomputer for writing documents only, it is truly a wasteful mistake. It is like using a sledgehammer to crack a nut. Each process has own computational demands. If we can give the proper computation power to each process, this would be more sufficient than the before. In this manner, heterogeneous multi-processing (HMP) arose. HMP is a promising technique that can support both high and low demand tasks efficiently. This topic has been investigated in some prior works but an efficient system software to support HMP with self-adaptive computing has been little researched, especially on multithreaded applications. Therefore, we propose HARS, a heterogeneity-aware runtime system for self-adaptive multithreaded applications. HARS monitors application-level performance data and dynamically controls the system state to achieve the performance target with efficient power consumption. As an extended version of HARS, we also propose MP-HARS, which that supports multiple applications. Through our evaluation, we can see that HARS and MP-HARS achieve higher efficiency then the baseline version and HARS is comparable to the static optimal version.ope

    Proceedings of the Second International Mobile Satellite Conference (IMSC 1990)

    Get PDF
    Presented here are the proceedings of the Second International Mobile Satellite Conference (IMSC), held June 17-20, 1990 in Ottawa, Canada. Topics covered include future mobile satellite communications concepts, aeronautical applications, modulation and coding, propagation and experimental systems, mobile terminal equipment, network architecture and control, regulatory and policy considerations, vehicle antennas, and speech compression

    Rubik: fast analytical power management for latency-critical systems

    Get PDF
    Latency-critical workloads (e.g., web search), common in datacenters, require stable tail (e.g., 95th percentile) latencies of a few milliseconds. Servers running these workloads are kept lightly loaded to meet these stringent latency targets. This low utilization wastes billions of dollars in energy and equipment annually. Applying dynamic power management to latency-critical workloads is challenging. The fundamental issue is coping with their inherent short-term variability: requests arrive at unpredictable times and have variable lengths. Without knowledge of the future, prior techniques either adapt slowly and conservatively or rely on application-specific heuristics to maintain tail latency. We propose Rubik, a fine-grain DVFS scheme for latency-critical workloads. Rubik copes with variability through a novel, general, and efficient statistical performance model. This model allows Rubik to adjust frequencies at sub-millisecond granularity to save power while meeting the target tail latency. Rubik saves up to 66% of core power, widely outperforms prior techniques, and requires no application-specific tuning. Beyond saving core power, Rubik robustly adapts to sudden changes in load and system performance. We use this capability to design RubikColoc, a colocation scheme that uses Rubik to allow batch and latency-critical work to share hardware resources more aggressively than prior techniques. RubikColoc reduces datacenter power by up to 31% while using 41% fewer servers than a datacenter that segregates latency-critical and batch work, and achieves 100% core utilization.National Science Foundation (U.S.) (Grant CCF-1318384

    Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems

    Get PDF
    Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming. The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends. The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines. The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past

    On-board B-ISDN fast packet switching architectures. Phase 1: Study

    Get PDF
    The broadband integrate services digital network (B-ISDN) is an emerging telecommunications technology that will meet most of the telecommunications networking needs in the mid-1990's to early next century. The satellite-based system is well positioned for providing B-ISDN service with its inherent capabilities of point-to-multipoint and broadcast transmission, virtually unlimited connectivity between any two points within a beam coverage, short deployment time of communications facility, flexible and dynamic reallocation of space segment capacity, and distance insensitive cost. On-board processing satellites, particularly in a multiple spot beam environment, will provide enhanced connectivity, better performance, optimized access and transmission link design, and lower user service cost. The following are described: the user and network aspects of broadband services; the current development status in broadband services; various satellite network architectures including system design issues; and various fast packet switch architectures and their detail designs