72 research outputs found
SeLeP: Learning Based Semantic Prefetching for Exploratory Database Workloads
Prefetching is a crucial technique employed in traditional databases to
enhance interactivity, particularly in the context of data exploitation. Data
exploration is a query processing paradigm in which users search for insights
buried in the data, often not knowing what exactly they are looking for. Data
exploratory tools deal with multiple challenges such as the need for
interactivity with no a priori knowledge being present to help with the system
tuning. The state-of-the-art prefetchers are specifically designed for
navigational workloads only, where the number of possible actions is limited.
The prefetchers that work with SQL-based workloads, on the other hand, mainly
rely on data logical addresses rather than the data semantics. They fail to
predict complex access patterns in cases where the database size is
substantial, resulting in an extensive address space, or when there is frequent
co-accessing of data. In this paper, we propose SeLeP, a semantic prefetcher
that makes prefetching decisions for both types of workloads, based on the
encoding of the data values contained inside the accessed blocks. Following the
popular path of using machine learning approaches to automatically learn the
hidden patterns, we formulate the prefetching task as a time-series forecasting
problem and use an encoder-decoder LSTM architecture to learn the data access
pattern. Our extensive experiments, across real-life exploratory workloads,
demonstrate that SeLeP improves the hit ratio up to 40% and reduces I/O time up
to 45% compared to the state-of-the-art, attaining impressive 95% hit ratio and
80% I/O reduction on average
Securing Cloud File Systems using Shielded Execution
Cloud file systems offer organizations a scalable and reliable file storage
solution. However, cloud file systems have become prime targets for
adversaries, and traditional designs are not equipped to protect organizations
against the myriad of attacks that may be initiated by a malicious cloud
provider, co-tenant, or end-client. Recently proposed designs leveraging
cryptographic techniques and trusted execution environments (TEEs) still force
organizations to make undesirable trade-offs, consequently leading to either
security, functional, or performance limitations. In this paper, we introduce
TFS, a cloud file system that leverages the security capabilities provided by
TEEs to bootstrap new security protocols that meet real-world security,
functional, and performance requirements. Through extensive security and
performance analyses, we show that TFS can ensure stronger security guarantees
while still providing practical utility and performance w.r.t. state-of-the-art
systems; compared to the widely-used NFS, TFS achieves up to 2.1X speedups
across micro-benchmarks and incurs <1X overhead for most macro-benchmark
workloads. TFS demonstrates that organizations need not sacrifice file system
security to embrace the functional and performance advantages of outsourcing
TACKLING PERFORMANCE AND SECURITY ISSUES FOR CLOUD STORAGE SYSTEMS
Building data-intensive applications and emerging computing paradigm (e.g., Machine Learning (ML), Artificial Intelligence (AI), Internet of Things (IoT) in cloud computing environments is becoming a norm, given the many advantages in scalability, reliability, security and performance. However, under rapid changes in applications, system middleware and underlying storage device, service providers are facing new challenges to deliver performance and security isolation in the context of shared resources among multiple tenants. The gap between the decades-old storage abstraction and modern storage device keeps widening, calling for software/hardware co-designs to approach more effective performance and security protocols. This dissertation rethinks the storage subsystem from device-level to system-level and proposes new designs at different levels to tackle performance and security issues for cloud storage systems.
In the first part, we present an event-based SSD (Solid State Drive) simulator that models modern protocols, firmware and storage backend in detail. The proposed simulator can capture the nuances of SSD internal states under various I/O workloads, which help researchers understand the impact of various SSD designs and workload characteristics on end-to-end performance.
In the second part, we study the security challenges of shared in-storage computing infrastructures. Many cloud providers offer isolation at multiple levels to secure data and instance, however, security measures in emerging in-storage computing infrastructures are not studied. We first investigate the attacks that could be conducted by offloaded in-storage programs in a multi-tenancy cloud environment. To defend against these attacks, we build a lightweight Trusted Execution Environment, IceClave to enable security isolation between in-storage programs and internal flash management functions. We show that while enforcing security isolation in the SSD controller with minimal hardware cost, IceClave still keeps the performance benefit of in-storage computing by delivering up to 2.4x better performance than the conventional host-based trusted computing approach.
In the third part, we investigate the performance interference problem caused by other tenants' I/O flows. We demonstrate that I/O resource sharing can often lead to performance degradation and instability. The block device abstraction fails to expose SSD parallelism and pass application requirements. To this end, we propose a software/hardware co-design to enforce performance isolation by bridging the semantic gap. Our design can significantly improve QoS (Quality of Service) by reducing throughput penalties and tail latency spikes.
Lastly, we explore more effective I/O control to address contention in the storage software stack. We illustrate that the state-of-the-art resource control mechanism, Linux cgroups is insufficient for controlling I/O resources. Inappropriate cgroup configurations may even hurt the performance of co-located workloads under memory intensive scenarios. We add kernel support for limiting page cache usage per cgroup and achieving I/O proportionality
Next-Gen Hybrid Memory and Interconnect System Architectures
This dissertation mainly addresses two problems that emerge along with the 'big data' trend: the increasing demands of memory capacity for mobile computing platform, and the needs for interconnection network with higher bandwidth/energy efficiency in the HPC/Data Center. The current mobile applications have rapidly growing memory footprints, posing a great challenge for memory system design. Insufficient DRAM main memory will incur frequent data swaps between memory and storage, a process that hurts performance, consumes energy and deteriorates the write endurance of typical flash storage devices. Alternately, a larger DRAM has higher leakage power and drains the battery faster. Further, DRAM scaling trends make further growth of DRAM in the mobile space prohibitive due to cost. Emerging non-volatile memory (NVM) has the potential to alleviate these issues due to its higher capacity per cost than DRAM and minimal static power. Recently, a wide spectrum of NVM technologies, including phase-change memories (PCM), memristor, and 3D XPoint have emerged. Despite the mentioned advantages, NVM has longer access latency compared to DRAM and NVM writes can incur higher latencies and wear costs. Therefore integration of these new memory technologies in the memory hierarchy requires a fundamental rearchitecting of traditional system designs. In this work, we propose a hardware-accelerated memory manager (HMMU) that addresses both types of memory in a flat space address space. We design a set of data placement and data migration policies within this memory manager, such that we may exploit the advantages of each memory technology. By augmenting the system with this HMMU, we reduce the overall memory latency while also reducing energy consumption and writes to the NVM. Experimental results show that our design achieves a 39% reduction in energy consumption with only a 12% performance degradation versus an all-DRAM baseline that is likely untenable in the future. After developing the pure hardware memory management for the data migration between DRAM and NVM, we consider to integrate information from the software stack into our system. These software information, such as programmers' hints or application profiling results, reveals the longer-term memory access pattern and data object properties; but they come at the cost of high software latency. Hardware approaches can avoid the latencies of software kernel processes related to page migration, such as page fault handling. However, hardware's vision is limited to a short time window, as it can only monitor and analyze the recently received memory requests. Ideally, the execution time advantages of pure hardware approaches, should be combined with the data object properties in a global scope. Further, application programmer's hints could guide the data placement at the allocation time, thus data objects with similar property could be congregated to reduce unnecessary page migrations. In this work, we propose such a hardware-software cooperative approach. In particular, we built a heap memory manager that allows the programmer to choose the memory type for each data object allocation. Such denotations are relayed to the hardware memory manager as hints for the decisions on data placement and migration. Meanwhile the hardware memory manager is still capable of capturing the per-application phase changes and maintaining flexibility in its data redistribution. The integration of the two mechanisms leads to optimal results from both long-term and short-term aspects. Experiment results show that our design shortens the overall memory latency while also reducing energy consumption and writes to the NVM versus prior approaches. Our design achieves a 40% reduction in energy consumption with only a 16% performance degradation versus the all-DRAM memory system. As for the HPC/Data domain, a primary problem is how to scale up the interconnection network to service the ever-increasing number of nodes. Photonic-links, with its high bandwidth and low signal loss across long distance propagation, is a promising technology to solve this problem. The higher bandwidth allows the router to connect more nodes while the long-distance connection makes it possible to implement more advanced typologies, such as the flattened butterfly. Both factors help to reduce the average number of hops between nodes across the network. Such high-radix and short distance network is essential to provisioning low latency communications in massive scale systems. However, due to the different physical and device properties, interconnection network needs redesign to adopt the photonic links. We first listed the basic formulas and design flow for interconnection network, and introduced a highly efficient event-driven simulator. Then we conducted a series of experiments to explore the design space, and gave a quantitative comparison between interconnection networks made of pure electrical links and those with electronic/photonic hybrid design
Optimization of a data buffer for a hardwired prefetcher in a PCRAM controller buffer
학위논문 (석사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. 이혁재.본 논문에서는 PCRAM based storage의 cache buffer 성능 향상을 위한 prefetcher 구조에 대한 연구를 수행한다. Hdd, Nand-SSD를 위한 일반적인 software prefetcher가 아닌 PCRAM hard-wired controller를 위한 경량화된 hardware prefetching 구조를 제안함으로서 NVM stroage cache buffer의 최적화를 수행한다.
이 때 hardware prefetcher의 implementation시 단점으로 부각되는 history buffer의 area overhead, prefetching algorithm의 hardware complexity를 개선하기 위해 application id polling과 address boundary detector를 이용한 필터 구성으로 history buffer를 경량화하였다. application id poller는 단위 시점에서의 populated application id를 선정하고 해당 appication id의 cache miss address만을 sequential boundary detector로 인가한다. Sequential boundary detector는 miss address의 sequentiality를 detect하여 history buffer에 기록하고 이를 바탕으로 유형별 prefetch request를 생성한다
Real-life storage workload로 controller의 average latency를 측정하였고, 약 14%의 read latency개선으로 동일 성능 필요 cache buffer size의 50%만을 필요케끔 cache size가 최적화 됨을 확인하였다In this paper, we study the prefetcher structure to improve the cache buffer performance of PCRAM based storage. We perform optimization of the NVM stroage cache buffer by proposing a lightweight hardware prefetching structure for PCRAM hard-wired controller, not a general software prefetcher for HDD and Nand-SSD.
At this time, in order to improve the area overhead of the history buffer and hardware complexity of the prefetching algorithm, which is a drawback when implementing the hardware prefetcher, the history buffer is lightened by configuring a filter using application id polling and sequential address boundary detector. The application id poller selects the populated application id at the unit time point and applies only the cache miss address of the application id to the sequential boundary detector. Sequential boundary detector detects the sequentiality of miss addresses, records them in the history buffer, and creates prefetch requests for each type based on this.
The average latency of the controller was measured with a real-life storage workload, and it was confirmed that the cache size was optimized so that only 50% of the cache buffer size required the same performance by improving the read latency of about 14%목 차
제 1 장 서 론 1
1.1 연구 배경 및 목표 1
1.2 논문의 구성 2
제 2 장 관련 연구 3
2.1 Prefetching 3
2.1.1 Hardware Prefetching 3
2.1.2 Prefetching for Storage 5
2.2 Previous Prefetching Work 6
2.2.1 ReadAhead 6
2.2.2 SPAN-Prefetching for NVM 7
제 3 장 Application ID polling과 Sequential Boundary Detector를 이용한 hardware prefetching 방식 제안 8
3.1 기존 연구의 문제점 9
3.2 관측 10
3.3 Prefetching 구조 제안 . 12
3.3.1 Application ID Polling 14
3.3.2 Sequential Boundary Detector 19
3.3.3 History Buffer 24
3.3.4 Overall implementation 25
제 4 장 실험 결과 및 분석 29
4.1 실험 구현 방식 29
4.2 제안 방식에 따른 결과 32
제 5 장 결론 36
참고 문헌 37
Abstract 39Maste
Progger 3: A low-overhead, tamper-proof provenance system
Data provenance, which describes how data is accessed and used since the time it is created, is a valuable resource with a wide range of uses. It can be used simply to know who has accessed one's data, or be used in more complex scenarios such as detecting malware. One method for collecting data provenance is to observe system calls. This thesis presents Progger 3, a system that observes system calls on Linux in order to collect data provenance. There are several existing provenance systems that observe system calls, but they have limitations regarding security, efficiency, and usability. Progger 3 remedies many of these limitations. As a result, Progger 3 is a working implementation of a provenance system that can observe any system call, guarantee tamper-proof provenance collection as long as the kernel on the client is not compromised, and transfer the provenance to other systems with confidentiality and integrity, all with a relatively low performance overhead
Memory Management for Emerging Memory Technologies
The Memory Wall, or the gap between CPU speed and main memory latency, is ever increasing. The latency of Dynamic Random-Access Memory (DRAM) is now of the order of hundreds of CPU cycles. Additionally, the DRAM main memory is experiencing power, performance and capacity constraints that limit process technology scaling. On the other hand, the workloads running on such systems are themselves changing due to virtualization and cloud computing demanding more performance of the data centers. Not only do these workloads have larger working set sizes, but they are also changing the way memory gets used, resulting in higher sharing and increased bandwidth demands. New Non-Volatile Memory technologies (NVM) are emerging as an answer to the current main memory issues.
This thesis looks at memory management issues as the emerging memory technologies get integrated into the memory hierarchy. We consider the problems at various levels in the memory hierarchy, including sharing of CPU LLC, traffic management to future non-volatile memories behind the LLC, and extending main memory through the employment of NVM.
The first solution we propose is “Adaptive Replacement and Insertion" (ARI), an adaptive approach to last-level CPU cache management, optimizing the cache miss rate and writeback rate simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving miss rate relative to conventional LRU replacement policy, with minimal hardware overhead. ARI reduces writebacks on benchmarks from SPEC2006 suite on average by 32.9% while also decreasing misses on average by 4.7%. In a PCM based memory system, this decreases energy consumption by 23% compared to LRU and provides a 49% lifetime improvement beyond what is possible with randomized wear-leveling.
Our second proposal is “Variable-Timeslice Thread Scheduling" (VATS), an OS kernel-level approach to CPU cache sharing. With modern, large, last-level caches (LLC), the time to fill the LLC is greater than the OS scheduling window. As a result, when a thread aggressively thrashes the LLC by replacing much of the data in it, another thread may not be able to recover its working set before being rescheduled. We isolate the threads in time by increasing their allotted time quanta, and allowing larger periods of time between interfering threads. Our approach, compared to conventional scheduling, mitigates up to 100% of the performance loss caused by CPU LLC interference. The system throughput is boosted by up to 15%.
As an unconventional approach to utilizing emerging memory technologies, we present a Ternary Content-Addressable Memory (TCAM) design with Flash transistors. TCAM is successfully used in network routing but can also be utilized in the OS Virtual Memory applications. Based on our layout and circuit simulation experiments, we conclude that our FTCAM block achieves an area improvement of 7.9× and a power improvement of 1.64× compared to a CMOS approach.
In order to lower the cost of Main Memory in systems with huge memory demand, it is becoming practical to extend the DRAM in the system with the less-expensive NVMe Flash, for a much lower system cost. However, given the relatively high Flash devices access latency, naively using them as main memory leads to serious performance degradation. We propose OSVPP, a software-only, OS swap-based page prefetching scheme for managing such hybrid DRAM + NVM systems. We show that it is possible to gain about 50% of the lost performance due to swapping into the NVM and thus enable the utilization of such hybrid systems for memory-hungry applications, lowering the memory cost while keeping the performance comparable to the DRAM-only system
Retro: Targeted Resource Management in Multi-tenant Distributed Systems
Abstract In distributed systems shared by multiple tenants, effective resource management is an important pre-requisite to providing quality of service guarantees. Many systems deployed today lack performance isolation and experience contention, slowdown, and even outages caused by aggressive workloads or by improperly throttled maintenance tasks such as data replication. In this work we present Retro, a resource management framework for shared distributed systems. Retro monitors per-tenant resource usage both within and across distributed systems, and exposes this information to centralized resource management policies through a high-level API. A policy can shape the resources consumed by a tenant using Retro's control points, which enforce sharing and ratelimiting decisions. We demonstrate Retro through three policies providing bottleneck resource fairness, dominant resource fairness, and latency guarantees to high-priority tenants, and evaluate the system across five distributed systems: HBase, Yarn, MapReduce, HDFS, and Zookeeper. Our evaluation shows that Retro has low overhead, and achieves the policies' goals, accurately detecting contended resources, throttling tenants responsible for slowdown and overload, and fairly distributing the remaining cluster capacity
Optimisation des caches de fichiers dans les environnements virtualisés
Les besoins en ressources de calcul sont en forte augmentation depuis plusieurs décennies, que ce soit pour des applications du domaine des réseaux sociaux, du calcul haute performance, ou du big data. Les entreprises se tournent alors vers des solutions d'externalisation de leurs services informatiques comme le Cloud Computing. Le Cloud Computing permet une mutalisation des ressources informatiques dans un datacenter et repose généralement sur la virtualisation. Cette dernière permet de décomposer une machine physique, appelée hôte, en plusieurs machines virtuelles (VM) invitées. La virtualisation engendre de nouveaux défis dans la conception des systèmes d'exploitation, en particulier pour la gestion de la mémoire. La mémoire est souvent utilisée pour accélérer les coûteux accès aux disques, en conservant ou préchargeant les données du disque dans le cache fichiers. Seulement la mémoire est une ressource limitée et limitante pour les environnements virtualisés, affectant ainsi les performances des applications utilisateurs. Il est alors nécessaire d'optimiser l'utilisation du cache de fichiers dans ces environnements. Dans cette thèse, nous proposons deux approches orthogonales pour améliorer les performances des applications à l'aide d'une meilleure utilisation du cache fichiers. Dans les environnements virtualisés, hôte et invités exécutent chacun leur propre système d'exploitation (OS) et ont donc chacun un cache de fichiers. Lors de la lecture d'un fichier, les données se retrouvent présentes dans les deux caches. Seulement, les deux OS exploitent la même mémoire physique. On parle de duplication des pages du cache. La première contribution vise à pallier ce problème avec Cacol, une politique d'éviction de cache s'exécutant dans l'hôte et non intrusive vis-à-vis de la VM. Cacol évite ces doublons de pages réduisant ainsi l'utilisation de la mémoire d'une machine physique. La seconde approche est d'étendre le cache fichiers des VM en exploitant de la mémoire disponible sur d'autres machines du datacenter. Cette seconde contribution, appelée Infinicache, s'appuie sur Infiniband, un réseau RDMA à haute vitesse, et exploite sa capacité à lire et à écrire sur de la mémoire à distance. Directement implémenté dans le cache invité, Infinicache stocke les pages évincées de son cache sur de la mémoire à distance. Les futurs accès à ces pages sont alors plus rapides que des accès aux disques de stockage, améliorant par conséquent les performances des applications. De plus, le taux d'utilisation de la mémoire à l'échelle du datacenter est augmenté, réduisant le gaspillage de manière globale
- …