5 research outputs found
Nature of System Calls in CPU-centric Computing Paradigm
Modern operating systems are typically POSIX-compliant with major system
calls specified decades ago. The next generation of non-volatile memory (NVM)
technologies raise concerns about the efficiency of the traditional POSIX-based
systems. As one step toward building high performance NVM systems, we explore
the potential dependencies between system call performance and major hardware
components (e.g., CPU, memory, storage) under typical user cases (e.g.,
software compilation, installation, web browser, office suite) in this paper.
We build histograms for the most frequent and time-consuming system calls with
the goal to understand the nature of distribution on different platforms. We
find that there is a strong dependency between the system call performance and
the CPU architecture. On the other hand, the type of persistent storage plays a
less important role in affecting the performance
Understanding Persistent-Memory Related Issues in the Linux Kernel
Persistent memory (PM) technologies have inspired a wide range of PM-based
system optimizations. However, building correct PM-based systems is difficult
due to the unique characteristics of PM hardware. To better understand the
challenges as well as the opportunities to address them, this paper presents a
comprehensive study of PM-related issues in the Linux kernel. By analyzing
1,553 PM-related kernel patches in-depth and conducting experiments on
reproducibility and tool extension, we derive multiple insights in terms of PM
patch categories, PM bug patterns, consequences, fix strategies, triggering
conditions, and remedy solutions. We hope our results could contribute to the
development of robust PM-based storage systemsComment: ACM TRANSACTIONS ON STORAGE(TOS'23
Manifesting reliability issues in Storage Systems
Storage systems are vital in managing the ever increasing data generated by High Performance Computing based and Cloud-based applications.
Therefore ensuring reliability while providing desired performance is important.
However, building reliable storage systems is challenging and system may fail due to reasons such as power fault, device failure, software bugs, etc.
In such events, storage systems rely on recovery components to bring the system back to a consistent. Unfortunately, similar failure events may occur while performing system recovery and can lead to severe corruptions in the file systems.
On the other hand, storage systems are constantly updated to accommodate new storage technologies such as Persistent Memory (PM) devices to satisfy the demands for high performance. PM devices are storage class memory devices that offer low access latency and data persistence. In addition, these devices offer new features such as Direct-Access (DAX) that bypasses the complex Linux storage stack.
However building new storage systems using PM devices is quite a challenge.
Firstly, there is a new method to access data on these devices. Unlike traditional storage device that operate on block IO interface, PM devices operate over memory IO interface. Therefore, system developers need to develop new methods to access data.
Secondly, the Linux kernel had to be modified by including new drivers to accommodate the devices and modifying file systems to support new DAX feature. These modifications can increase the complexity of the storage stack and may hinder the reliability of the storage system.
Therefore, as a first step towards building reliable storage systems, this dissertation emphasizes on manifesting the reliability issues explained above. For this we first begin with analyzing the impact of interrupted recovery procedures on the durability of storage systems. To do this we build a fault injection framework to systematically interrupt the recovery procedure of four popular Linux file systems (Ext4, XFS, BtrFS and F2FS). We observe that not only does interrupted recovery induce severe corruption in file system, these corruptions are permanent and cannot be fixed by another run of recovery. We conclude this part by building a generalized redo log library with transaction support that can be easily integrated with existing recovery components to provide some resilience against interruptions.
Second, we analyze the impact of PM software stack on system reliability by performing a study on PM-related issues reported in the Linux kernel. To do this we collect all patches submitted to the Linux kernel over the last decade and extract 1,553 PM-related kernel patches. We study these patches in depth and characterize PM-related bugs based on their cause. In addition, we also conduct experiments on PM bug reproducibility and evaluating existing bug detection tools to derive multiple insights such as bug manifesting conditions, remedy solutions, etc. The intuition to perform this study is to assist future work in building tools that can effectively manifest these bugs. Therefore we have open sourced our dataset and workloads utilized to reproduce a subset of PM bugs