5 research outputs found
Reducing Application Launch Time by Using Execution-time Prefetching Techniques
νμλ
Όλ¬Έ (λ°μ¬)-- μμΈλνκ΅ λνμ : μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2013. 2. μ νμ.μ΅κ·Ό λͺ¨λ°μΌ κΈ°κΈ°μ μ¬μ©μ΄ 보νΈνλλ©΄μ νλ‘κ·Έλ¨ μ€νμ μλ΅μ±μ μ¬μ©μμ 체νμ ν° μν₯μ μ£Όλ μμκ° λμλ€. νΉν, μμ©νλ‘κ·Έλ¨μ κΈ°λμκ°μ κΈ°κΈ°μ λν μ¬μ©μμ 체κ°μ±λ₯μ νκ°νλ μ€μν μ§νλ‘ μ¬μ©λλ€. νμ§λ§ νλμ κΈ°λ°μ λμ€ν¬κ° μμ€ν
λμ€ν¬λ‘ μ¬μ©λλ κ²½μ°μλ μ¬μ©μλ€μ κΈ΄ μμ©νλ‘κ·Έλ¨ κΈ°λμκ°μ μμ£Ό κ²½ννλ€. νλ‘μΈμλ λμ€ν¬ μ₯μΉλ λ³λ ¬μ±μ μ΄μ©νμ¬ μ±λ₯μ κ°μ νλ λ°λ©΄, μμ©νλ‘κ·Έλ¨ κΈ°λ μμλ μμλ€μ μ¬μ©μ΄ μ§λ ¬νλκΈ° λλ¬Έμ΄λ€.
μμ©νλ‘κ·Έλ¨μ κΈ°λμκ° λ¨μΆμ μνμ¬ λ³Έ λ
Όλ¬Έμμλ μλ‘μ΄ μ€νμκ° ν리νμΉ κΈ°λ²μ μ μνλ€. μμ©νλ‘κ·Έλ¨μ μ΅μ΄ κΈ°λ μ μ κ·Όλλ λΈλ‘μ μ νν νμ
νκ³ , μ΄ν κΈ°λ μ μ΄ λΈλ‘λ€μ ν¨μ¨μ μΈ λ°©λ²μΌλ‘ λμ€ν¬μΊμμ μ μ¬ν¨μΌλ‘μ¨ κΈ°λμκ°μ λ¨μΆμν¨λ€. ν΅μ¬ μ λ΅μ νλ‘μΈμμ λμ€ν¬μ μ¬μ©μ λ³λ ¬ννκ³ λμ€ν¬μ λ΄λΆλ³λ ¬μ±κ³Ό λ©ν°μ½μ΄μ νμ©μ μ λνμλ€. λ, ν리νμΉ μκ°μ λ¨μΆνκΈ° μνμ¬ λμ€ν¬μ νΉμ±μ λ°λΌ λ€μν λ³ν©, λ
Όλ¦¬λΈλ‘λ²νΈ μ λ ¬, ν리νμΉ μμ€μ μμ‘΄μ± ν΄κ²° κΈ°λ²μ μ¬μ©νμλ€.
μ μν ν리νμΉ κΈ°λ²μ 리λ
μ€ μ»€λ 3.5.0μ ꡬννμκ³ λ§μ΄ μ¬μ©λλ μμ©νλ‘κ·Έλ¨μ μ΄μ©νμ¬ μ±λ₯μ νκ°νμλ€. νλλμ€ν¬ κΈ°λ°μ λ°μ€ν¬ν μν¬λ‘λμμ μ½λμ€ννΈ μκ° λλΉ νκ· 52%μ κΈ°λμκ°μ λ¨μΆνμκ³ SSDλ₯Ό μ¬μ©ν κ²½μ° 34.1%κ° λ¨μΆνμλ€. λ, SSDλ₯Ό μ¬μ©νλ λͺ¨λ°μΌμ© Meego νλ«νΌμμ νκ· 28.1 ~ 34.1%μ κΈ°λμκ°μ λ¨μΆνμκ³ , μλλ‘μ΄λ νλ«νΌμ΄ νμ¬λ κ°€λμ λ₯μμ€ ν°μμ νκ· 12.8%μ κΈ°λμκ°μ λ¨μΆνμλ€. λ§μ§λ§μΌλ‘, μ μ μμ€μμ ꡬνν ν리νμ³λ₯Ό μ¬μ©ν κ²½μ°, SSDλ₯Ό μ¬μ©νλ νκ²½μμ νκ· 21.7 ~ 28.5%μ κΈ°λμκ°μ λ¨μΆνμλ€.
μ μν κΈ°λ²μ κΈ°μ‘΄μ νκ²½μ ꡬννμ¬ μ΄μ©νλλ° κ²½λ―Έν μ€λ²ν€λλ₯Ό μ λ°νλ ννΈ, μμ€ν
μ μλ΅μλλ₯Ό κ°μ νκ³ μ¬μ©μμ μ²΄κ° μλλ₯Ό ν₯μμν΄μΌλ‘μ λ°μ€ν¬ν PCμ μ€λ§νΈν°μ μ±λ₯ν₯μμ μ μλ―Έν κΈ°μ¬λ₯Ό ν κ²μ΄λ€.Recently, as mobile devices are widely used, an application responsiveness is of great importance to user experience. Among many metrics, application launch performance is one of important indices to evaluate user-perceived system performance. However, users suffer from long application launch delay even if they use flash-based disk as their system disks. It is mainly because system resources are used in serialized manner during application launch process while processors and disk drives improve their performance by exploiting parallelism.
To optimize launch performance, this dissertation presents a new execution-time prefetching technique, which monitors accessed blocks accurately during the first launch of each application and prefetches them into disk caches in the optimized order at their subsequent launches. The key idea is to overlap processor computation with disk I/O while exploiting internal parallelism on disk drives effectively. In order to optimize prefetch performance, we employ various merge, logical-block-number sort, and prefetch-level dependency resolution schemes.
We implemented the proposed prefetcher on Linux kernel 3.5.0 and evaluated it by launching a set of widely-used applications. Experiments demonstrate an average of 52% reduction of application launch time on an HDD-based system and 34.1% reduction on an SSD-based system as compared to cold start performance. We also achieve an average of 28.1 ~ 31.4% reduction on mobile Meego platform using an SSD as a system disk. And We port the proposed prefetcher to Android platform and achieve an average of 12.8% reduction of widely-used android applications on Galaxy Nexus phone. In addition, We implemented the proposed prefetcher at user-level which does not require kernel modification. It demonstrated an average of 21.7 ~ 28.5% reduction of application launch time on SSDs.
The proposed scheme incurs little overhead from its implementation and operations in the existing environment. It is expected to make significant contributions to performance enhancement of desktop PCs and smartphones by improving both system and user-perceived performance.μ 1 μ₯ μ λ‘ 1
1.1 μ°κ΅¬ λκΈ° 1
1.2 μ°κ΅¬ λ΄μ© λ° μμ 3
1.3 λ
Όλ¬Έμ κ΅¬μ± 8
μ 2 μ₯ μ°κ΅¬ λ°°κ²½ 9
2.1 λ²μ© λμ€ν¬ λλΌμ΄λΈ 9
2.1.1 νλλμ€ν¬ λλΌμ΄λΈ 9
2.1.2 NAND νλμ κΈ°λ° Solid-State Drive (SSD) 10
2.1.3 νμ΄λΈλ¦¬λ νλλμ€ν¬ 12
2.2 리λ
μ€μ λμ€ν¬ μ
μΆλ ₯ λΆ μμ€ν
14
2.2.1 리λ
μ€μ λμ€ν¬ μ
μΆλ ₯ μ€ν 14
2.2.2 리λ
μ€μ λμ€ν¬μΊμ 16
2.2.3 μ
μΆλ ₯ μ€μΌμ₯΄λ¬μ μ’
λ₯ λ° νΉμ§ 18
2.2.4 μ
μΆλ ₯ νλ¬κ·Έ/μΈνλ¬κ·Έ 20
2.2.5 ν리νμΉ μ±κ³΅ μ μ μ½λλ νλ‘μΈμ μκ° λΆμ 21
2.3 μμ©νλ‘κ·Έλ¨μ λΉ λ₯Έ κΈ°λμ μν κΈ°μ‘΄ μ°κ΅¬ 23
2.3.1 μμ©νλ‘κ·Έλ¨μ λΉ λ₯Έ κΈ°λμ μν λμ€ν¬μΊμ± κΈ°λ² 23
2.3.2 λ²μ© μν¬λ‘λμ λΉ λ₯Έ μλ΅μ μν λμ€ν¬μΊμ± κΈ°λ² 26
2.3.3 κ·Έ μΈμ κΈ°λ²λ€ 29
μ 3 μ₯ μμ©νλ‘κ·Έλ¨ κΈ°λ μμ λμ νΉμ± λΆμ 31
3.1 κΈ°λ μλλ¦¬μ€ 31
3.2 κΈ°λ μ λ°μλλ λμ€ν¬ μ
μΆλ ₯ λΆμ 32
3.3 νλ‘μΈμμ λμ€ν¬μ νμ±ν ν¨ν΄ λΆμ 34
μ 4 μ₯ 컀λ μμ€ μ€νμκ° ν리νμ³μ μ€κ³, ꡬν λ° νκ° 37
4.1 μ€νμκ° ν리νμ³μ μκ° λ° λͺ©ν 37
4.2 κΈ°λμνμ€ μμ§ 41
4.3 ν리νμΉμνμ€ μ€μΌμ₯΄λ¬ 44
4.3.1 μ΅μ€ν
νΈ-μμ‘΄μ± (Extent-Dependency) λΆμ 44
4.3.2 λΈλ‘ κ° μμ‘΄μ± ν΄κ²°μ μν λ©νλ°μ΄ν° μ¬ννΈ 46
4.3.3 κ±°λ¦¬κΈ°λ° λ³ν© 50
4.3.4 κ±°λ¦¬κΈ°λ° λΉκ³΅κ°μ±μ λ³ν© 51
4.3.5 λ
Όλ¦¬λΈλ‘λ²νΈ μ λ ¬ 52
4.3.6 νλ¬κ·Έ/μΈνλ¬κ·Έ 52
4.4 μμ©νλ‘κ·Έλ¨κ³Ό ν리νμ³ λμμ λ³λ ¬ν 54
4.4.1 νλλμ€ν¬λ₯Ό μ¬μ©νλ μμ€ν
54
4.4.2 SSDλ₯Ό μ¬μ©νλ μμ€ν
56
4.4.3 λ€μ€ λμ€ν¬λ₯Ό μ¬μ©νλ μμ€ν
57
4.5 κΈ°λμνμ€μ μ ν¨μ± κ΄λ¦¬ 57
4.6 μ΄μ체μ λΆνΈ ν리νμ³ 58
4.7 μ ν΄μκ° ν리νμ³ μΈν°νμ΄μ€ 59
4.8 μ€ν νκ²½ 60
4.9 μμ©νλ‘κ·Έλ¨ κΈ°λμκ° 64
4.10 μ΄μ λ° μ μ₯ κ³΅κ° μ€λ²ν€λ 78
4.11 컀λ μμ€ ν리νμ³μ μμ μ± 80
μ 5 μ₯ μ μ μμ€ μ€νμκ° ν리νμ³μ μ€κ³, ꡬν λ° νκ° 83
5.1 μ μ μμ€ ν리νμ³μ μκ° λ° κ΅¬μ‘° 83
5.2 μμ©νλ‘κ·Έλ¨μ ν리νμΉμνμ€ μμ± 85
5.2.1 λμ€ν¬ μ
μΆλ ₯ μ 보 μμ§ 85
5.2.2 κΈ°λμνμ€ μΆμΆ 85
5.2.3 ν리νμΉμνμ€ μ€μΌμ₯΄ 86
5.3 λΈλ‘-νμΌ μ¬μ (Map) 86
5.3.1 λΈλ‘-νμΌ μ¬μμ μκ° 86
5.3.2 κΈ°λμνμ€ κ΄λ ¨ νμΌ λͺ©λ‘μ μμ§ 88
5.4 μ μ μμ€μ ν리νμ³ νλ‘κ·Έλ¨ μμ± 89
5.5 μμ©νλ‘κ·Έλ¨ κΈ°λ κ΄λ¦¬μ 90
5.6 μ μ μμ€μ ν리νμ³μ μ₯μ λ° λ¨μ 93
5.7 μ€ν νκ²½ 93
5.8 μμ©νλ‘κ·Έλ¨ κΈ°λμκ° 94
5.9 μ΄μ λ° μ μ₯ κ³΅κ° μ€λ²ν€λ 95
μ 6 μ₯ κ²°λ‘ λ° ν₯ν μ°κ΅¬ λ°©ν₯ 96
6.1 κ²°λ‘ 96
6.2 ν₯ν μ°κ΅¬ λ°©ν₯ 99
μ°Έκ³ λ¬Έν 102
Abstract 113Docto
Analysis and optimization of storage IO in distributed and massive parallel high performance systems
Although Mooreβs law ensures the increase in computational power, IO performance appears to be left behind. This minimizes the benefits gained from increased computational power. Processors have to idle for a long time waiting for IO. Another factor that slows the IO communication is the increased parallelism required in todayβs computations. Most modern processing units are built from multiple weak cores. Since IO has a low parallelism the weak cores will decrease system performance. Furthermore to avoid added delay of external storage, future High Performance Computing (HPC) systems will employ Active Storage Fabrics (ASF). These embed storage directly into large HPC systems. Single HPC node IO performance will therefore require optimization. This can only be achieved with a full understanding of the IO stack operations. The analysis of the IO stack under the new conditions of multi-core and massive parallelism leads to some important conclusions. The IO stack is generally built for single devices and is heavily optimized for HDD. Two main optimization approaches are taken. The first is optimizing the IO stack to accommodate parallelism. Conclusions on IO analysis shows that a design based on several parallel operating storage devices is the best approach for parallelism in the IO stack. A parallel IO device with unified storage space is introduced. The unified storage space allows for optimal function division among resources for both read and write. The design also avoids large parallel file systems overhead by using limited changes to a conventional file system. Furthermore the interface of the IO stack is not changed by the design. This is a rather important restriction to avoid application rewrite. The implementation of such a design is shown to result in an increase in performance. The second approach is Optimizing the IO stack for Solid State Drives (SSD). The optimization for the new storage technology demanded further analysis. These show that the IO stack requires revision on many levels for optimal accommodation of SSD. File system preallocation of free blocks is used as an example. Preallocation is important for data contingency on HDD. However due to fast random access of SSD preallocation represents an overhead. By careful analysis to the block allocation algorithms, preallocation is removed. As an additional optimization approach IO compression is suggested for future work. It can utilize idle cores during an IO transaction to perform on the fly IO data compression