3 research outputs found

    Improved Cache-hot Page Allocation Technique for Reducing Page Initialization Latency of Linux Based Systems

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2019. 2. ํ™์„ฑ์ˆ˜.์ตœ๊ทผ ์‚ฌ์šฉ์ž ๋Œ€ํ™”ํ˜•(user-interactive) ์‘์šฉ๋“ค์€ OS์—๊ฒŒ ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์š”๊ตฌํ•œ๋‹ค๋Š” ํŠน์ง•์„ ๋ณด์ธ๋‹ค. ์‘์šฉ์˜ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์š”์ฒญ์ด ๋ฐœ์ƒํ•˜๋ฉด OS๋Š” ํŽ˜์ด์ง€ ๋‹จ์œ„๋กœ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ• ๋‹นํ•˜๋Š”๋ฐ, ์ด ๋•Œ OS๋Š” ํ• ๋‹นํ•  ํŽ˜์ด์ง€์˜ ์ดˆ๊ธฐํ™” ์ž‘์—…์„ ํ•„์ˆ˜์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ํŽ˜์ด์ง€ ํ• ๋‹น์„ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์‹œ๋„ํ•˜๋Š” ์‚ฌ์šฉ์ž ๋Œ€ํ™”ํ˜• ์‘์šฉ์—์„œ ์ด ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™” ์ž‘์—…์˜ ์ง€์—ฐ์ด ์‘์šฉ์˜ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๋Š” ์›์ธ์ด ๋˜๊ณ  ์žˆ๋‹ค. ๊ธฐ์กด ๋ฆฌ๋ˆ…์Šค ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์€ ์ด๋Ÿฌํ•œ ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™” ์ง€์—ฐ์„ ๋‹จ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด, CPU์˜ ์บ์‹œ์— ๋งคํ•‘๋˜์–ด ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™”์‹œ ๋น ๋ฅด๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ํŽ˜์ด์ง€์ธ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ํ• ๋‹นํ•œ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด ๋ฆฌ๋ˆ…์Šค๋Š” CPU์˜ ์‚ฌ์œ  ์บ์‹œ(private cache)์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋งŒ์„ ๊ณ ๋ คํ•˜๋ฉฐ, ๊ณต์œ  ์บ์‹œ(shared cache)์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•œ๋‹ค. ์ด ๋•Œ๋ฌธ์— ์‚ฌ์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๊ณต์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๊ฐ€ ์‹œ์Šคํ…œ์— ์กด์žฌํ•˜๋”๋ผ๋„ ์บ์‹œ-์ฝœ๋“œ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹น๋ฐ›๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์œ  ์บ์‹œ์™€ ๊ณต์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์—ฌ, ์‘์šฉ์ด ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹น๋ฐ›์„ ํ™•๋ฅ ์„ ๊ธฐ์กด ๊ธฐ๋ฒ•๋ณด๋‹ค ๋†’์ด๋Š” ํ–ฅ์ƒ๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€ ํ• ๋‹น ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์€ ํŽ˜์ด์ง€ ํ• ๋‹น ์š”์ฒญ์ด ๋ฐœ์ƒํ•˜๋ฉด ๋จผ์ € ๊ฐ ์ฝ”์–ด์˜ ์‚ฌ์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ์šฐ์„ ์ ์œผ๋กœ ํ• ๋‹นํ•œ๋‹ค. ๋งŒ์•ฝ ์‚ฌ์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๊ฐ€ ์—†๋‹ค๋ฉด ์บ์‹œ-์ฝœ๋“œ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹น๋ฐ›๋Š” ๋Œ€์‹  ๊ณต์œ  ์บ์‹œ์— ๋งคํ•‘๋œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹นํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€๋ฅผ ํ• ๋‹น๋ฐ›์„ ํ™•๋ฅ ์„ ๊ธฐ์กด ๊ธฐ๋ฒ•๋ณด๋‹ค ๋†’์ด๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ ํ‰๊ท  ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™” ์ง€์—ฐ์„ ๋‹จ์ถ•ํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ธฐ๋ฒ•์„ ๋ฆฌ๋ˆ…์Šค ์ปค๋„ 4.18.10๋ฒ„์ „ ๊ธฐ๋ฐ˜ ๋ฐ์Šคํฌํƒ‘ ํ™˜๊ฒฝ์—์„œ ๊ตฌํ˜„ํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, ํ‰๊ท  ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™” ์ง€์—ฐ์ด ๊ธฐ์กด ๋ฆฌ๋ˆ…์Šค ์‹œ์Šคํ…œ๊ณผ ๋น„๊ตํ•˜์—ฌ ์•ฝ 7% ๋‹จ์ถ•๋˜์—ˆ๋‹ค.Recently, user-interactive applications frequently request large amount of memory space to OS. When an application requests memory allocation, the OS allocates memory space in units of pages. In this case, the OS essentially performs the page initialization on the page which is to be allocated. In a user-interactive application that frequently requests to allocated pages, the latency of this page initialization causes the performance degradation of application. Legacy Linux kernel preferentially allocate a cache-hot page, which is a page that is mapped to CPUs cache. This cache-hot page can be accessed with short latency when page initialization. However, legacy Linux kernel only considers cache-hot pages which mapped to CPUs private cache and does not utilize the cache-hot pages which mapped to shared cache. For this reason, if there is no cache-hot page which mapped to private cache, kernel allocates cache-cold page while there are cache-hot pages which mapped to shared cache. In this paper, we propose an improved cache-hot page allocation method that improves the probability of cache-hot page allocation than the legacy Linux kernel, considering both cache-hot pages mapped to private and shared cache. The proposed method first allocates cache-hot pages mapped to each cores private cache when a page allocation request occurs. If there is no cache-hot page mapped to private cache, this method tries to allocate cache-hot page which mapped to shared cache instead of allocating cache-cold page. This increases the probability that the cache-hot pages are allocated higher than the legacy Linux kernel, and consequently reduce the average page initialization latency. We implemented proposed method on desktop environment based on Linux kernel 4.18.10. Experimental results show that the average page initialization latency is reduced by about 7% compared to legacy Linux kernel.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ์˜ ๋‚ด์šฉ 3 ์ œ 3 ์ ˆ ๋…ผ๋ฌธ์˜ ๊ตฌ์„ฑ 5 ์ œ 2 ์žฅ ๋ฆฌ๋ˆ…์Šค์˜ ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ 6 ์ œ 1 ์ ˆ ๋ฌผ๋ฆฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์ž๋ฃŒ๊ตฌ์กฐ 6 ์ œ 2 ์ ˆ ํŽ˜์ด์ง€ ํ• ๋‹น ๋ฐ ํ•ด์ œ ํ•ธ๋“ค๋Ÿฌ 7 ์ œ 3 ์žฅ ๋ฌธ์ œ ์ •์˜ 10 ์ œ 1 ์ ˆ ๋ฌธ์ œ ์„ค๋ช… 10 ์ œ 2 ์ ˆ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• ๊ฐœ๊ด€ 11 ์ œ 4 ์žฅ ๋‹ค์ค‘ ๋ ˆ๋ฒจ ๋ฆฌ์ŠคํŠธ๋ฅผ ํ†ตํ•œ ์บ์‹œ-ํ•ซ ํŽ˜์ด์ง€ ํ• ๋‹น ๊ธฐ๋ฒ• 12 ์ œ 1 ์ ˆ ํŽ˜์ด์ง€ ํ•ด์ œ 14 ์ œ 2 ์ ˆ ํŒจ์ด์ง€ ํ• ๋‹น 15 ์ œ 5 ์žฅ ์‹คํ—˜์„ ํ†ตํ•œ ๊ฒ€์ฆ 17 ์ œ 1 ์ ˆ ํ‰๊ท  ํŽ˜์ด์ง€ ์ดˆ๊ธฐํ™” ์ง€์—ฐ ์ธก์ • 17 ์ œ 2 ์ ˆ ์˜ค๋ฒ„ํ—ค๋“œ ์ธก์ • 18 ์ œ 6 ์žฅ ๊ด€๋ จ ์—ฐ๊ตฌ 20 ์ œ 7 ์žฅ ๊ฒฐ๋ก  22 ์ฐธ๊ณ ๋ฌธํ—Œ 23 Abstract 25Maste

    Introducing kernel-level page reuse for high performance computing

    No full text
    International audienc

    Methods for efficient resource utilization in statistical machine learning algorithms

    Get PDF
    In recent years, statistical machine learning has emerged as a key technique for tackling problems that elude a classic algorithmic approach. One such problem, with a major impact on human life, is the analysis of complex biomedical data. Solving this problem in a fast and efficient manner is of major importance, as it enables, e.g., the prediction of the efficacy of different drugs for therapy selection. While achieving the highest possible prediction quality appears desirable, doing so is often simply infeasible due to resource constraints. Statistical learning algorithms for predicting the health status of a patient or for finding the best algorithm configuration for the prediction require an excessively high amount of resources. Furthermore, these algorithms are often implemented with no awareness of the underlying system architecture, which leads to sub-optimal resource utilization. This thesis presents methods for efficient resource utilization of statistical learning applications. The goal is to reduce the resource demands of these algorithms to meet a given time budget while simultaneously preserving the prediction quality. As a first step, the resource consumption characteristics of learning algorithms are analyzed, as well as their scheduling on underlying parallel architectures, in order to develop optimizations that enable these algorithms to scale to larger problem sizes. For this purpose, new profiling mechanisms are incorporated into a holistic profiling framework. The results show that one major contributor to the resource issues is memory consumption. To overcome this obstacle, a new optimization based on dynamic sharing of memory is developed that speeds up computation by several orders of magnitude in situations when available main memory is the bottleneck, leading to swapping out memory. One important application that can be applied for automated parameter tuning of learning algorithms is model-based optimization. Within a huge search space, algorithm configurations are evaluated to find the configuration with the best prediction quality. An important step towards better managing this search space is to parallelize the search process itself. However, a high runtime variance within the configuration space can cause inefficient resource utilization. For this purpose, new resource-aware scheduling strategies are developed that efficiently map evaluations of configurations to the parallel architecture, depending on their resource demands. In contrast to classical scheduling problems, the new scheduling interacts with the configuration proposal mechanism to select configurations with suitable resource demands. With these strategies, it becomes possible to make use of the full potential of parallel architectures. Compared to established parallel execution models, the results show that the new approach enables model-based optimization to converge faster to the optimum within a given time budget
    corecore