Search CORE

34 research outputs found

Reducing DRAM Row Activations with Eager Writeback

Author: Jeon Myeongjae
Publication venue
Publication date: 06/09/2012
Field of study

This thesis describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, dirty cache lines that have not been recently accessed are eagerly written to DRAM when the corresponding row has been activated by an ordinary access, such as a read. This approach enables clustering of reads and writes that target the same row, resulting in a significant reduction in row activations. Specifically, for 29 applications, it reduces the number of DRAM row activations by an average of 38% and a maximum of 81%. The results from a full system simulator show that for the 29 applications, 11 have performance improvements between 10% and 20%, and 9 have improvements in excess of 20%. Furthermore, 10 consume between 10% and 20% less DRAM energy, and 10 have energy consumption reductions in excess of 20%

DSpace at Rice University

Cost-effective On-device Continual Learning over Memory Hierarchy with Miro

Author: Choi Jonghyun
Jeon Myeongjae
Jeong Suyeon
Ma Xinyue
Wang Di
Zhang Minjia
Publication venue
Publication date: 13/08/2023
Field of study

Continual learning (CL) trains NN models incrementally from a continuous stream of tasks. To remember previously learned knowledge, prior studies store old samples over a memory hierarchy and replay them when new tasks arrive. Edge devices that adopt CL to preserve data privacy are typically energy-sensitive and thus require high model accuracy while not compromising energy efficiency, i.e., cost-effectiveness. Our work is the first to explore the design space of hierarchical memory replay-based CL to gain insights into achieving cost-effectiveness on edge devices. We present Miro, a novel system runtime that carefully integrates our insights into the CL framework by enabling it to dynamically configure the CL system based on resource states for the best cost-effectiveness. To reach this goal, Miro also performs online profiling on parameters with clear accuracy-energy trade-offs and adapts to optimal values with low overhead. Extensive evaluations show that Miro significantly outperforms baseline systems we build for comparison, consistently achieving higher cost-effectiveness.Comment: This paper is to be published in the 29th Annual International Conference on Mobile Computing and Networking (ACM MobiCom 23

arXiv.org e-Print Archive

Data Diversification Analysis on Data Preprocessing

Author: Hong Heelim
Jeon Myeongjae
Jin Ze
Kim Changdae
Kim Minseok
Kim Taeyoon
Park ChanHo
Shin Ji-Yong
Publication venue
Publication date: 23/09/2023
Field of study

A statistical analysis to examine the diversity distribution resulting from two different approaches: The first one, the standard approach, is a baseline augmentation approach where a random augmentation is applied to each sample in each epoch independently; The second one, the random batch approach, is another new augmentation approach designed where a random augmentation is applied to each tiny-batch in each epoch independently, and which samples are in the same tiny-batch is random and independent across all epochs

ZENODO

Predictive Parallelization: A Framework for Reducing Tail Latencies of Web Search Queries

Author: Jeon Myeongjae
Publication venue
Publication date
Field of study

We have become dependent on web search in our everyday lives. Web search services aim to provide fast responses to user queries, making the tail latency more important to reduce than the average latency. With modern multicore servers, intra-query parallelization becomes a desirable technique for reducing the query response time. Our workload characterization of commercial search engine servers shows that using parallelization to reduce the tail latency is challenging: (1) The search workload consists of mainly short-running queries that do not benefit from parallelism, and a few long-running queries which significantly impact the tail but exhibit high parallelism speedup. (2) The spare resources available to parallelize queries vary over time. This thesis presents predictive parallelization, a framework designed for addressing these challenges and reducing tail latencies in web search. There are two fundamental techniques used as key elements of framework design. First, intra-query parallelization of index searching parallelizes each individual query with small overhead. The key idea is for a parallel search to mimic the sequential order of execution that almost never scans the entire index. Second, query execution time predictor identifies a majority of long-running queries through machine learning. The predictor covers a comprehensive feature set to improve prediction accuracy while avoiding expensive features that have excessive requirements such as large memory footprints. In turn, heuristic algorithms in the framework exploit both query and system load information to decide parallelism degree on a query-by-query basis. At runtime, they selectively parallelize long-running queries with high parallelism efficiency and adapt the parallelism degree to system load. All of the techniques and mechanisms proposed in this thesis have been implemented and evaluated experimentally on production servers and workloads

DSpace at Rice University

Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Author: Jeon Myeongjae
Kim Taeyoon
Qian Junjie
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/04/2021
Field of study

Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences

ScholarWorks@UNIST

Domain Level Page Sharing in Xen Virtual Machine Systems

Author: Jeon Myeongjae
Kim Junghyun
Lee Joowon
Seo Euiseong
Publication venue: 7th International Symposium on Advanced Parallel Processing Technologies, APPT 2007
Publication date: 22/11/2007
Field of study

The memory size limits the scalability of virtual machine systems. There have been some researches about sharing identical pages among guest systems to reduce memory usage. However, they require memory overcommitment feature through swap mechanism which some virtual machines including Xen do not have. In this paper a new approach is proposed to share identical pages with designated sharing area. This approach reduces the memory usage as well as redundant I/O operations. Moreover, understanding the characteristics of certain shared pages becomes easier. The conceptional design was evaluated by simulation based on real-world applications

ScholarWorks@UNIST

SWAN: WAN-aware Stream Processing on Geographically-distributed Clusters

Author: Chun Byung-Gon
Jeon Myeongjae
Song Won Wook
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 24/08/2022
Field of study

Wide-area stream analytics is commonly being used to extract operational or business insights from the data issued from multiple distant datacenters. However, timely processing of such data streams is challenging because wide-area network (WAN) bandwidth is scarce and varies widely across both different geo-locations (i.e., spatially) and points of time (i.e., temporally). Stream analytics desirable under a WAN setup requires the consideration of path diversity and the associated bandwidth from data source to sink when performing operator task placement for the query execution plan. It also has to enable fast adaptation to dynamic resource conditions, e.g., changes in network bandwidth, to keep the query execution stable. We present SWAN, a WAN stream analytics engine that incorporates two key techniques to meet the aforementioned requirements. First, SWAN provides a fast heuristic model that captures WAN characteristics at runtime and evenly distributes tasks to nodes while maximizing the network bandwidth for intermediate data. Second, SWAN exploits a stream relaying operator (or RO) to extend a query plan for better facilitating path diversity. This is driven by our observation that oftentimes, a longer path with more communication hops provides higher bandwidth to reach the data sink than a shorter path, allowing us to trade-off query latency for higher query throughput. SWAN stretches a given query plan by adding ROs at compile time to opportunistically place it over such a longer path. In practice, throughput gains do not necessarily lead to significant latency increases, due to higher network bandwidth providing more in-flight data transfers. Our prototype improves the latency and the throughput of stream analytics performances by 77.6% and 5.64??, respectively, compared to existing approaches, and performs query adaptations within seconds

ScholarWorks@UNIST

Reducing DRAM Row Activations with Eager Read/Write Clustering

Author: Cox Alan L.
Jeon Myeongjae
Li Conglong
Rixner Scott
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2013
Field of study

This article describes and evaluates a new approach to optimizing DRAM performance and energy consumption that is based on eagerly writing dirty cache lines to DRAM. Under this approach, many dirty cache lines are written to DRAM before they are evicted. In particular, dirty cache lines that have not been recently accessed are eagerly written to DRAM when the corresponding row has been activated by an ordinary, noneager access, such as a read. This approach enables clustering of reads and writes that target the same row, resulting in a significant reduction in row activations. Specifically, for a variety of applications, it reduces the number of DRAM row activations by an average of 42% and a maximum of 82%. Moreover, the results from a full-system simulator show compelling performance improvements and energy consumption reductions. Out of 23 applications, 6 have overall performance improvements between 10% and 20%, and 3 have improvements in excess of 20%. Furthermore, 12 consume between 10% and 20% less DRAM energy, and 7 have energy consumption reductions in excess of 20%

ScholarWorks@UNIST