168 research outputs found

    Analysis of False Cache Line Sharing Effects on Multicore CPUs

    Get PDF
    False sharing (FS) is a well-known problem occurring in multiprocessor systems. It results in performance degradation on multi-threaded programs running on multiprocessor environments. With the evolution of processor architecture over time, the multicore processor is a recent direction used by hardware designers to increase performance while avoiding heat and power walls. To fully exploit the processing power from these multicore hardware architectures, the software programmer needs to build applications using parallel programming concepts, which are based upon multi-threaded programming principles. Since the architecture of a multicore processor is very similar to a multiprocessor system, the presence of the false sharing problem is speculated. Its effects should be measurable in terms of efficiency degradation in a concurrent environment on multicore systems. This project discusses the causes of the false sharing problem in dual-core CPUs, and demonstrates how it lessens the system performance by measuring efficiency of a test program in sequential compared to parallel versions. Thus, demonstration programs are developed to read a CPU cache line size, and collect the execution results of the test program with and without false sharing on the specific system hardware. Certain techniques are implemented to eliminate false sharing. These techniques are described, and their effectiveness in mitigating the speed-up and efficiency lost from false sharing is analyzed.False sharing (FS) is a well-known problem occurring in multiprocessor systems. It results in performance degradation on multi-threaded programs running on multiprocessor environments. With the evolution of processor architecture over time, the multicore processor is a recent direction used by hardware designers to increase performance while avoiding heat and power walls. To fully exploit the processing power from these multicore hardware architectures, the software programmer needs to build applications using parallel programming concepts, which are based upon multi-threaded programming principles. Since the architecture of a multicore processor is very similar to a multiprocessor system, the presence of the false sharing problem is speculated. Its effects should be measurable in terms of efficiency degradation in a concurrent environment on multicore systems. This project discusses the causes of the false sharing problem in dual-core CPUs, and demonstrates how it lessens the system performance by measuring efficiency of a test program in sequential compared to parallel versions. Thus, demonstration programs are developed to read a CPU cache line size, and collect the execution results of the test program with and without false sharing on the specific system hardware. Certain techniques are implemented to eliminate false sharing. These techniques are described, and their effectiveness in mitigating the speed-up and efficiency lost from false sharing is analyzed

    Custom architecture for multicore audio Beamforming systems

    Get PDF
    The audio Beamforming (BF) technique utilizes microphone arrays to extract acoustic sources recorded in a noisy environment. In this article, we propose a new approach for rapid development of multicore BF systems. Research on literature reveals that the majority of such experimental and commercial audio systems are based on desktop PCs, due to their high-level programming support and potential of rapid system development. However, these approaches introduce performance bottlenecks, excessive power consumption, and increased overall cost. Systems based on DSPs require very low power, but their performance is still limited. Custom hardware solutions alleviate the aforementioned drawbacks, however, designers primarily focus on performance optimization without providing a high-level interface for system control and test. In order to address the aforementioned problems, we propose a custom platform-independent architecture for reconfigurable audio BF systems. To evaluate our proposal, we implement our architecture as a heterogeneous multicore reconfigurable processor and map it onto FPGAs. Our approach combines the software flexibility of General-Purpose Processors (GPPs) with the computational power of multicore platforms. In order to evaluate our system we compare it against a BF software application implemented to a low-power Atom 330, amiddle-ranged Core2 Duo, and a high-end Core i3. Experimental results suggest that our proposed solution can extract up to 16 audio sources in real time under a 16-microphone setup. In contrast, under the same setup, the Atom 330 cannot extract any audio sources in real time, while the Core2 Duo and the Core i3 can process in real time only up to 4 and 6 sources respectively. Furthermore, a Virtex4-based BF system consumes more than an order less energy compared to the aforementioned GPP-based approaches. © 2013 ACM

    Використання структур сучасних комп’ютерних систем для реалізації систем обробки знань

    Get PDF
    У зв’язку зі стрімким розвитком технологій програмованих логічних інтегральних схем розробка нових, більш продуктивних архітектур комп’ютерів для обробки знань залишається актуальною задачею. В роботі виконано аналіз архітектур проблемно-орієнтованих систем, а також універсальних систем. Розглянуто переваги та шляхи використання комп’ютерів із універсальною архітектурою для реалізації систем обробки знань.В связи со стремительным ростом технологий программируемых логических инте- гральных схем разработка новых, более производительных архитектур компьютеров для обработки знаний остается актуальной задачей. В работе выполнен анализ архитектур проблемноориентированных и универсальных компьютерных систем. Рассмотрены преимущества и пути использования компьютеров с универсальной архитектурой для реализации систем обработки знаний.Due to the rapid technologies growth of programmable logic integrated circuits, developing of a new computer’s architectures, which effectively support knowledge processing systems, remains a relevant problem. The analysis of task-oriented architectures and universal computer systems was conducted. The advantages and ways of using computers with a universal architecture for knowledge-processing systems implementation were considered

    Task Activity Vectors: A Novel Metric for Temperature-Aware and Energy-Efficient Scheduling

    Get PDF
    This thesis introduces the abstraction of the task activity vector to characterize applications by the processor resources they utilize. Based on activity vectors, the thesis introduces scheduling policies for improving the temperature distribution on the processor chip and for increasing energy efficiency by reducing the contention for shared resources of multicore and multithreaded processors

    Benchmarking GPUs to tune dense linear algebra

    Full text link

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    Modeling Data Center Co-Tenancy Performance Interference

    Get PDF
    A multi-core machine allows executing several applications simultaneously. Those jobs are scheduled on different cores and compete for shared resources such as the last level cache and memory bandwidth. Such competitions might cause performance degradation. Data centers often utilize virtualization to provide a certain level of performance isolation. However, some of the shared resources cannot be divided, even in a virtualized system, to ensure complete isolation. If the performance degradation of co-tenancy is not known to the cloud administrator, a data center often has to dedicate a whole machine for a latency-sensitive application to guarantee its quality of service. Co-run scheduling attempts to make good utilization of resources by scheduling compatible jobs into one machine while maintaining their service level agreements. An ideal co-run scheduling scheme requires accurate contention modeling. Recent studies for co-run modeling and scheduling have made steady progress to predict performance for two co-run applications sharing a specific system. This thesis advances co-tenancy modeling in three aspects. First, with an accurate co-run modeling for one system at hand, we propose a regression model to transfer the knowledge and create a model for a new system with different hardware configuration. Second, by examining those programs that yield high prediction errors, we further leverage clustering techniques to create a model for each group of applications that show similar behavior. Clustering helps improve the prediction accuracy of those pathological cases. Third, existing research is typically focused on modeling two application co-run cases. We extend a two-core model to a three- and four-core model by introducing a light-weight micro-kernel that emulates a complicated benchmark through program instrumentation. Our experimental evaluation shows that our cross-architecture model achieves an average prediction error less than 2% for pairwise co-runs across the SPECCPU2006 benchmark suite. For more than two application co-tenancy modeling, we show that our model is more scalable and can achieve an average prediction error of 2-3%

    Portable, scalable, per-core power estimation for intelligent resource management

    Get PDF
    Performance, power, and temperature are now all first-order design constraints. Balancing power efficiency, thermal constraints, and performance requires some means to convey data about real-time power consumption and temperature to intelligent resource managers. Resource managers can use this information to meet performance goals, maintain power budgets, and obey thermal constraints. Unfortunately, obtaining the required machine introspection is challenging. Most current chips provide no support for per-core power monitoring, and when support exists, it is not exposed to software. We present a methodology for deriving per-core power models using sampled performance counter values and temperature sensor readings. We develop application-independent models for four different (four- to eight-core) platforms, validate their accuracy, and show how they can be used to guide scheduling decisions in power-aware resource managers. Model overhead is negligible, and estimations exhibit 1.1%-5.2% per-suite median error on the NAS, SPEC OMP, and SPEC 2006 benchmarks (and 1.2%-4.4% overall)

    Hard real-time performances in multiprocessor-embedded systems using ASMP-Linux

    Get PDF
    Multiprocessor systems, especially those based on multicore or multithreaded processors, and new operating system architectures can satisfy the ever increasing computational requirements of embedded systems.ASMP-LINUX is a modified, high responsiveness, open-source hard real-time operating system for multiprocessorsystems capable of providing high real-time performance while maintaining the code simple and not impacting on theperformances of the rest of the system. Moreover, ASMP-LINUX does not require code changing or application recompiling/relinking.In order to assess the performances of ASMP-LINUX, benchmarks have been performed on several hardware platformsand configurations
    corecore