Search CORE

787 research outputs found

Memory hierarchy studies of multimedia- enhanced simultaneous multithreaded processors for MPEG-2 video decompression

Author: Sigmund Ulrich
Ungerer Theo
Publication venue
Publication date: 02/08/2007
Field of study

KITopen

Recommended from our members

Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler

Author: Federova Alexandra
Seltzer Margo I.
Smith Michael D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/12/2012
Field of study

We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application’s performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. This performance dependency is caused by unfair, corunner-dependent cache allocation on CMPs. Poor performance isolation interferes with the operating system’s control over priority enforcement and hinders QoS provisioning. Previous solutions required modifications to the hardware. We present a new software solution. Our cache-fair algorithm ensures that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated. If the thread executes fewer instructions per cycle than it would under fair cache allocation, the scheduler increases that thread’s CPU timeslice. This way, the thread’s overall performance does not suffer because it is allowed to use the CPU longer. We describe our implementation of the algorithm in Solaris™ 10, and show that it significantly improves performance isolation for SPEC CPU, SPEC JBB and TPC-C.Engineering and Applied Science

Harvard University - DASH

Object Database Scalability for Scientific Workloads

Author: Bunn Julian J.
Holtman Koen
Newman Harvey B.
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/1999
Field of study

We describe the PetaByte-scale computing challenges posed by the next generation of particle physics experiments, due to start operation in 2005. The computing models adopted by the experiments call for systems capable of handling sustained data acquisition rates of at least 100 MBytes/second into an Object Database, which will have to handle several PetaBytes of accumulated data per year. The systems will be used to schedule CPU intensive reconstruction and analysis tasks on the highly complex physics Object data which need then be served to clients located at universities and laboratories worldwide. We report on measurements with a prototype system that makes use of a 256 CPU HP Exemplar X Class machine running the Objectivity/DB database. Our results show excellent scalability for up to 240 simultaneous database clients, and aggregate I/O rates exceeding 150 Mbytes/second, indicating the viability of the computing models

Caltech Authors

Doctor of Philosophy

Author: Nellans David
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

The University of Utah: J. Willard Marriott Digital Library

Effect of Hyper-Threading in Latency-Critical Multithreaded Cloud Applications and Utilization Analysis of the Major System Resources

Author: Feliu-Pérez Josué
Gómez Requena María Engracia
Huang Chaoyi
Petit Martí Salvador Vicente
Pons Terol Julio
Pons-Escat Lucía
Puche-Lara José
Sahuquillo Borrás Julio
Publication venue: Elsevier
Publication date: 01/06/2022
Field of study

[EN] Multithreaded latency-critical applications represent an important subset of workloads running on public cloud systems. Most of these systems deploy powerful computing servers including Intel Hyper-Threading processors. Understanding how performance is affected by the consumption of the main system resources is a major concern for cloud providers in order to devise virtualization strategies that improve the system efficiency. With this aim, this paper first characterizes the impact of QPS on tail latency, analyzing different scenarios varying the number of threads and the thread-to-core allocation (single-task and multi-task execution) policy. The characterization study reveals that the performance of some applications does not scale with the number of threads, and the performance of some others is insensitive to the Hyper-Threading technology, so they can be allocated in less physical cores and improve system utilization. Identifying these applications, however, at run-time is challenging. Despite identifying these applications at run-time is challenging, this paper shows that they can be successfully detected at run-time by analyzing the utilization trend of the major system resources. In addition to CPU, we have also studied how assigning the share of each application of other major shared system resources impacts on performance. We outline considerations cloud providers should take into account to improve performance and resource utilization.Acknowledgments This work has been supported by Huawei Cloud, and in part by Spanish Ministerio de Universidades under grant FPU18/01948, and by Spanish Ministerio de Universidades and European ERDF under grant RTI2018-098156-B-C51.Pons-Escat, L.; Feliu-Pérez, J.; Puche-Lara, J.; Huang, C.; Petit Martí, SV.; Pons Terol, J.; Gómez Requena, ME.... (2022). Effect of Hyper-Threading in Latency-Critical Multithreaded Cloud Applications and Utilization Analysis of the Major System Resources. Future Generation Computer Systems. 131:194-208. https://doi.org/10.1016/j.future.2022.01.02519420813

RiuNet

Performance analysis of Intel Core 2 Duo processor

Author: Prakash Tribuvan Kumar
Publication venue: LSU Digital Commons
Publication date: 01/01/2007
Field of study

With the emergence of thread level parallelism as a more efficient method of improving processor performance, Chip Multiprocessor (CMP) technology is being more widely used in developing processor architectures. Also, the widening gap between CPU and memory speed has evoked the interest of researchers to understand performance of memory hierarchical architectures. As part of this research, performance characteristic studies were carried out on the Intel Core 2 Duo, a dual core power efficient processor, using a variety of new generation benchmarks. This study provides a detailed analysis of the memory hierarchy performance and the performance scalability between single and dual core processors. The behavior of SPEC CPU2006 benchmarks running on Intel Core 2 Duo processor is also explained. Lastly, the overall execution time and throughput measurement using both multi-programmed and multi-threaded workloads for the Intel Core 2 Duo processor is reported and compared to that of the Intel Pentium D and AMD Athlon 64X2 processors. Results showed that the Intel Core 2 Duo had the best performance for a variety of workloads due to its advanced micro-architectural features such as the shared L2 cache, fast cache to cache communication and smart memory access

Louisiana State University

Analysis of Multi-Threading and Cache Memory Latency Masking on Processor Performance Using Thread Synchronization Technique

Author: Ehis Akhigbe-mudu Thursday
Publication venue: Brazilian Journal of Science Publications
Publication date: 25/09/2023
Field of study

Multithreading is a process in which a single processor executes multiple threads concurrently. This enables the processor to divide tasks into separate threads and run them simultaneously, thereby increasing the utilization of available system resources and enhancing performance. When multiple threads share an object and one or more of them modify it, unpredictable outcomes may occur. Threads that exhibit poor locality of memory reference, such as database applications, often experience delays while waiting for a response from the memory hierarchy. This observation suggests how to better manage pipeline contention. To assess the impact of memory latency on processor performance, a dual-core MT machine with four thread contexts per core is utilized. These specific benchmarks are chosen to allow the workload to include programs with both favorable and unfavorable cache locality. To eliminate the issue of wasting the wake-up signals, this work proposes an approach that involves storing all the wake-up calls. It asserts the wake-up calls to the consumer and the producer can store the wake-up call in a variable.   An assigned value in working system (or kernel) storage that each process can check is a semaphore. Semaphore is a variable that reads, and update operations automatically in bit mode. It cannot be actualized in client mode since a race condition may persistently develop when two or more processors endeavor to induce to the variable at the same time. This study includes code to measure the time taken to execute both functions and plot the graph. It should be noted that sending multiple requests to a website simultaneously could trigger a flag, ultimately blocking access to the data. This necessitates some computation on the collected statistics. The execution time is reduced to one third when using threads compared to executing the functions sequentially. This exemplifies the power of multithreading

Brazilian Journal of Science

Database Servers on Chip Multiprocessors: Limitations and Opportunities

Author: Ailamaki Anastassia
Falsafi Babak
Hardavellas Nikos
Johnson Ryan
Mancheril Naju
Pandis Ippokratis
Publication venue
Publication date: 10/10/2007
Field of study

Prior research shows that database system performance is dominated by off-chip data stalls, resulting in a concerted effort to bring data into on-chip caches. At the same time, high levels of integration have enabled the advent of chip multiprocessors and increasingly large (and slow) on-chip caches. These two trends pose the imminent technical and research challenge of adapting high-performance data management software to a shifting hardware landscape. In this paper we characterize the performance of a commercial database server running on emerging chip multiprocessor technologies. We find that the major bottleneck of current software is data cache stalls, with L2 hit stalls rising from oblivion to become the dominant execution time component in some cases. We analyze the source of this shift and derive a list of features for future database designs to attain maximum performance

Infoscience - École polytechnique fédérale de Lausanne

An Analysis of Database System Performance on Chip Multiprocessors

Author: Ailamaki Anastasia
Falsafi Babak
Hardavellas Nikos
Johnson Ryan
Mancheril Naju
Pandis Ippokratis
Publication venue
Publication date: 21/04/2009
Field of study

Infoscience - École polytechnique fédérale de Lausanne