44 research outputs found
Reducing branch delay to zero in pipelined processors
A mechanism to reduce the cost of branches in pipelined processors is described and evaluated. It is based on the use of multiple prefetch, early computation of the target address, delayed branch, and parallel execution of branches. The implementation of this mechanism using a branch target instruction memory is described. An analytical model of the performance of this implementation makes it possible to measure the efficiency of the mechanism with a very low computational cost. The model is used to determine the size of cache lines that maximizes the processor performance, to compare the performance of the mechanism with that of other schemes, and to analyze the performance of the mechanism with two alternative cache organizations.Peer ReviewedPostprint (published version
InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
Deep learning-based recommender models (DLRMs) have become an essential
component of many modern recommender systems. Several companies are now
building large compute clusters reserved only for DLRM training, driving new
interest in cost- and time- saving optimizations. The systems challenges faced
in this setting are unique; while typical deep learning training jobs are
dominated by model execution, the most important factor in DLRM training
performance is often online data ingestion.
In this paper, we explore the unique characteristics of this data ingestion
problem and provide insights into DLRM training pipeline bottlenecks and
challenges. We study real-world DLRM data processing pipelines taken from our
compute cluster at Netflix to observe the performance impacts of online
ingestion and to identify shortfalls in existing pipeline optimizers. We find
that current tooling either yields sub-optimal performance, frequent crashes,
or else requires impractical cluster re-organization to adopt. Our studies lead
us to design and build a new solution for data pipeline optimization, InTune.
InTune employs a reinforcement learning (RL) agent to learn how to distribute
the CPU resources of a trainer machine across a DLRM data pipeline to more
effectively parallelize data loading and improve throughput. Our experiments
show that InTune can build an optimized data pipeline configuration within only
a few minutes, and can easily be integrated into existing training workflows.
By exploiting the responsiveness and adaptability of RL, InTune achieves higher
online data ingestion rates than existing optimizers, thus reducing idle times
in model execution and increasing efficiency. We apply InTune to our real-world
cluster, and find that it increases data ingestion throughput by as much as
2.29X versus state-of-the-art data pipeline optimizers while also improving
both CPU & GPU utilization.Comment: Accepted at RecSys 2023. 11 pages, 2 pages of references. 8 figures
with 2 table
Recommended from our members
Multi-core processors and the future of parallelism in software
The purpose of this thesis is to examine multi-core technology. Multi-core architecture provides benefits such as less power consumption, scalability, and improved application performance enabled by thread-level parallelism
A Shared memory multiprocessor system architecture utilizing a uniform
Due to VLSI lithography problems and the limitation of additional architectural enhancements uniprocessor systems are nearing the end of their life cycle. Therefore, it is believed that Symmetric Multiprocessing (SMP) systems will be the next mainstream computer. These systems allow multiple processors, accessing the same memory image, to cooperate on a number of computational tasks as a single entity. While multiprocessor systems can offer a substantial performance increase compared to uniprocessor systems, major design considerations must be addressed to achieve desired system efficiency levels. Managing cache coherence is a significant problem in multiprocessor systems. Current implementations cope with this problem by utilizing a cache coherence protocol. This protocol puts a large amount of overhead on the system bus to ensure proper program execution, effectively decreasing overall system performance. This thesis approaches the cache coherence problem from a new angle. Instead of utilizing a cache coherence protocol, a new memory system is proposed which eliminates the need for a cache coherence protocol, by utilizing a shared level 2 data-only cache. This new architecture allows for better utilization of the system and improved performance and scalability. A data rate analysis is conducted to demonstrate the potential performance increase from the proposed architecture over conventional approaches. The data rate model clearly shows an increase in system performance and utilization when using the architecture proposed in this thesis
The Performance Cost of Security
Historically, performance has been the most important feature when optimizing computer hardware. Modern processors are so highly optimized that every cycle of computation time matters. However, this practice of optimizing for performance at all costs has been called into question by new microarchitectural attacks, e.g. Meltdown and Spectre. Microarchitectural attacks exploit the effects of microarchitectural components or optimizations in order to leak data to an attacker. These attacks have caused processor manufacturers to introduce performance impacting mitigations in both software and silicon.
To investigate the performance impact of the various mitigations, a test suite of forty-seven different tests was created. This suite was run on a series of virtual machines that tested both Ubuntu 16 and Ubuntu 18. These tests investigated the performance change across version updates and the performance impact of CPU core number vs. default microarchitectural mitigations. The testing proved that the performance impact of the microarchitectural mitigations is non-trivial, as the percent difference in performance can be as high as 200%
Space Station Freedom data management system growth and evolution report
The Information Sciences Division at the NASA Ames Research Center has completed a 6-month study of portions of the Space Station Freedom Data Management System (DMS). This study looked at the present capabilities and future growth potential of the DMS, and the results are documented in this report. Issues have been raised that were discussed with the appropriate Johnson Space Center (JSC) management and Work Package-2 contractor organizations. Areas requiring additional study have been identified and suggestions for long-term upgrades have been proposed. This activity has allowed the Ames personnel to develop a rapport with the JSC civil service and contractor teams that does permit an independent check and balance technique for the DMS
Castell: a heterogeneous cmp architecture scalable to hundreds of processors
Technology improvements and power constrains have taken multicore architectures to dominate
microprocessor designs over uniprocessors. At the same time, accelerator based architectures
have shown that heterogeneous multicores are very efficient and can provide high throughput for
parallel applications, but with a high-programming effort. We propose Castell a scalable chip
multiprocessor architecture that can be programmed as uniprocessors, and provides the high
throughput of accelerator-based architectures.
Castell relies on task-based programming models that simplify software development. These
models use a runtime system that dynamically finds, schedules, and adds hardware-specific features
to parallel tasks. One of these features is DMA transfers to overlap computation and data
movement, which is known as double buffering. This feature allows applications on Castell
to tolerate large memory latencies and lets us design the memory system focusing on memory
bandwidth.
In addition to provide programmability and the design of the memory system, we have used
a hierarchical NoC and added a synchronization module. The NoC design distributes memory
traffic efficiently to allow the architecture to scale. The synchronization module is a consequence
of the large performance degradation of application for large synchronization latencies.
Castell is mainly an architecture framework that enables the definition of domain-specific
implementations, fine-tuned to a particular problem or application. So far, Castell has been
successfully used to propose heterogeneous multicore architectures for scientific kernels, video
decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW).
It has also been used to explore a number of architecture optimizations such as enhanced DMA
controllers, and architecture support for task-based programming models.
ii