34 research outputs found

    Reducing DRAM Row Activations with Eager Writeback

    Get PDF
    This thesis describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, dirty cache lines that have not been recently accessed are eagerly written to DRAM when the corresponding row has been activated by an ordinary access, such as a read. This approach enables clustering of reads and writes that target the same row, resulting in a significant reduction in row activations. Specifically, for 29 applications, it reduces the number of DRAM row activations by an average of 38% and a maximum of 81%. The results from a full system simulator show that for the 29 applications, 11 have performance improvements between 10% and 20%, and 9 have improvements in excess of 20%. Furthermore, 10 consume between 10% and 20% less DRAM energy, and 10 have energy consumption reductions in excess of 20%

    Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

    Full text link
    Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.Comment: This paper is to be published in the 29th Annual International Conference on Mobile Computing and Networking (ACM MobiCom 23

    Data Diversification Analysis on Data Preprocessing

    Get PDF
    A statistical analysis to examine the diversity distribution resulting from two different approaches: The first one, the standard approach, is a baseline augmentation approach where a random augmentation is applied to each sample in each epoch independently; The second one, the random batch approach, is another new augmentation approach designed where a random augmentation is applied to each tiny-batch in each epoch independently, and which samples are in the same tiny-batch is random and independent across all epochs

    Predictive Parallelization: A Framework for Reducing Tail Latencies of Web Search Queries

    No full text
    We have become dependent on web search in our everyday lives. Web search services aim to provide fast responses to user queries, making the tail latency more important to reduce than the average latency. With modern multicore servers, intra-query parallelization becomes a desirable technique for reducing the query response time. Our workload characterization of commercial search engine servers shows that using parallelization to reduce the tail latency is challenging: (1) The search workload consists of mainly short-running queries that do not benefit from parallelism, and a few long-running queries which significantly impact the tail but exhibit high parallelism speedup. (2) The spare resources available to parallelize queries vary over time. This thesis presents predictive parallelization, a framework designed for addressing these challenges and reducing tail latencies in web search. There are two fundamental techniques used as key elements of framework design. First, intra-query parallelization of index searching parallelizes each individual query with small overhead. The key idea is for a parallel search to mimic the sequential order of execution that almost never scans the entire index. Second, query execution time predictor identifies a majority of long-running queries through machine learning. The predictor covers a comprehensive feature set to improve prediction accuracy while avoiding expensive features that have excessive requirements such as large memory footprints. In turn, heuristic algorithms in the framework exploit both query and system load information to decide parallelism degree on a query-by-query basis. At runtime, they selectively parallelize long-running queries with high parallelism efficiency and adapt the parallelism degree to system load. All of the techniques and mechanisms proposed in this thesis have been implemented and evaluated experimentally on production servers and workloads

    Reliability of Large Scale GPU Clusters for Deep Learning Workloads

    No full text
    Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences

    Domain Level Page Sharing in Xen Virtual Machine Systems

    No full text
    The memory size limits the scalability of virtual machine systems. There have been some researches about sharing identical pages among guest systems to reduce memory usage. However, they require memory overcommitment feature through swap mechanism which some virtual machines including Xen do not have. In this paper a new approach is proposed to share identical pages with designated sharing area. This approach reduces the memory usage as well as redundant I/O operations. Moreover, understanding the characteristics of certain shared pages becomes easier. The conceptional design was evaluated by simulation based on real-world applications

    SWAN: WAN-aware Stream Processing on Geographically-distributed Clusters

    No full text
    Wide-area stream analytics is commonly being used to extract operational or business insights from the data issued from multiple distant datacenters. However, timely processing of such data streams is challenging because wide-area network (WAN) bandwidth is scarce and varies widely across both different geo-locations (i.e., spatially) and points of time (i.e., temporally). Stream analytics desirable under a WAN setup requires the consideration of path diversity and the associated bandwidth from data source to sink when performing operator task placement for the query execution plan. It also has to enable fast adaptation to dynamic resource conditions, e.g., changes in network bandwidth, to keep the query execution stable. We present SWAN, a WAN stream analytics engine that incorporates two key techniques to meet the aforementioned requirements. First, SWAN provides a fast heuristic model that captures WAN characteristics at runtime and evenly distributes tasks to nodes while maximizing the network bandwidth for intermediate data. Second, SWAN exploits a stream relaying operator (or RO) to extend a query plan for better facilitating path diversity. This is driven by our observation that oftentimes, a longer path with more communication hops provides higher bandwidth to reach the data sink than a shorter path, allowing us to trade-off query latency for higher query throughput. SWAN stretches a given query plan by adding ROs at compile time to opportunistically place it over such a longer path. In practice, throughput gains do not necessarily lead to significant latency increases, due to higher network bandwidth providing more in-flight data transfers. Our prototype improves the latency and the throughput of stream analytics performances by 77.6% and 5.64??, respectively, compared to existing approaches, and performs query adaptations within seconds

    Reducing DRAM Row Activations with Eager Read/Write Clustering

    No full text
    This article describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, many dirty cache lines are written to DRAM before they are evicted. In particular, dirty cache lines that have not been recently accessed are eagerly written to DRAM when the corresponding row has been activated by an ordinary, noneager access, such as a read. This approach enables clustering of reads and writes that target the same row, resulting in a significant reduction in row activations. Specifically, for a variety of applications, it reduces the number of DRAM row activations by an average of 42% and a maximum of 82%. Moreover, the results from a full-system simulator show compelling performance improvements and energy consumption reductions. Out of 23 applications, 6 have overall performance improvements between 10% and 20%, and 3 have improvements in excess of 20%. Furthermore, 12 consume between 10% and 20% less DRAM energy, and 7 have energy consumption reductions in excess of 20%
    corecore