47 research outputs found
Achieving High Reliability and Efficiency in Maintaining Large-Scale Storage Systems through Optimal Resource Provisioning and Data Placement
With the explosive increase in the amount of data being generated by various applications, large-scale distributed and parallel storage systems have become common data storage solutions and been widely deployed and utilized in both industry and academia. While these high performance storage systems significantly accelerate the data storage and retrieval, they also bring some critical issues in system maintenance and management. In this dissertation, I propose three methodologies to address three of these critical issues.
First, I develop an optimal resource management and spare provisioning model to minimize the impact brought by component failures and ensure a highly operational experience in maintaining large-scale storage systems. Second, in order to cost-effectively integrate solid-state drives (SSD) into large-scale storage systems, I design a holistic algorithm which can adaptively predict the popularity of data objects by leveraging temporal locality in their access pattern and adjust their placement among solid-state drives and regular hard disk drives so that the data access throughput as well as the storage space efficiency of the large-scale heterogeneous storage systems can be improved. Finally, I propose a new checkpoint placement optimization model which can maximize the computation efficiency of large-scale scientific applications while guarantee the endurance requirements of the SSD-based burst buffer in high performance hierarchical storage systems. All these models and algorithms are validated through extensive evaluation using data collected from deployed large-scale storage systems and the evaluation results demonstrate our models and algorithms can significantly improve the reliability and efficiency of large-scale distributed and parallel storage systems
Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning
Due to the representation limitation of the joint Q value function,
multi-agent reinforcement learning methods with linear value decomposition
(LVD) or monotonic value decomposition (MVD) suffer from relative
overgeneralization. As a result, they can not ensure optimal consistency (i.e.,
the correspondence between individual greedy actions and the maximal true Q
value). In this paper, we derive the expression of the joint Q value function
of LVD and MVD. According to the expression, we draw a transition diagram,
where each self-transition node (STN) is a possible convergence. To ensure
optimal consistency, the optimal node is required to be the unique STN.
Therefore, we propose the greedy-based value representation (GVR), which turns
the optimal node into an STN via inferior target shaping and further eliminates
the non-optimal STNs via superior experience replay. In addition, GVR achieves
an adaptive trade-off between optimality and stability. Our method outperforms
state-of-the-art baselines in experiments on various benchmarks. Theoretical
proofs and empirical results on matrix games demonstrate that GVR ensures
optimal consistency under sufficient exploration
Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning
Effective exploration is crucial to discovering optimal strategies for
multi-agent reinforcement learning (MARL) in complex coordination tasks.
Existing methods mainly utilize intrinsic rewards to enable committed
exploration or use role-based learning for decomposing joint action spaces
instead of directly conducting a collective search in the entire
action-observation space. However, they often face challenges obtaining
specific joint action sequences to reach successful states in long-horizon
tasks. To address this limitation, we propose Imagine, Initialize, and Explore
(IIE), a novel method that offers a promising solution for efficient
multi-agent exploration in complex scenarios. IIE employs a transformer model
to imagine how the agents reach a critical state that can influence each
other's transition functions. Then, we initialize the environment at this state
using a simulator before the exploration phase. We formulate the imagination as
a sequence modeling problem, where the states, observations, prompts, actions,
and rewards are predicted autoregressively. The prompt consists of
timestep-to-go, return-to-go, influence value, and one-shot demonstration,
specifying the desired state and trajectory as well as guiding the action
generation. By initializing agents at the critical states, IIE significantly
increases the likelihood of discovering potentially important under-explored
regions. Despite its simplicity, empirical results demonstrate that our method
outperforms multi-agent exploration baselines on the StarCraft Multi-Agent
Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved
performance in the sparse-reward SMAC tasks and produces more effective
curricula over the initialized states than other generative methods, such as
CVAE-GAN and diffusion models.Comment: The 38th Annual AAAI Conference on Artificial Intelligenc
MGARD+: Optimizing Multilevel Methods for Error-Bounded Scientific Data Reduction
Nowadays, data reduction is becoming increasingly important in dealing with the large amounts of scientific data. Existing multilevel compression algorithms offer a promising way to manage scientific data at scale but may suffer from relatively low performance and reduction quality. In this paper, we propose MGARD+, a multilevel data reduction and refactoring framework drawing on previous multilevel methods, to achieve high-performance data decomposition and high-quality error-bounded lossy compression. Our contributions are four-fold: 1) We propose to leverage a level-wise coefficient quantization method, which uses different error tolerances to quantize the multilevel coefficients. 2) We propose an adaptive decomposition method which treats the multilevel decomposition as a preconditioner and terminates the decomposition process at an appropriate level. 3) We leverage a set of algorithmic optimization strategies to significantly improve the performance of multilevel decomposition/recompositing. 4) We evaluate our proposed method using four real-world scientific datasets and compare with several state-of-the-art lossy compressors. Experiments demonstrate that our optimizations improve the decomposition/recompositing performance of the existing multilevel method by up to 70x, and the proposed compression method can improve compression ratio by up to 2x compared with other state-of-the-art error-bounded lossy compressors under the same level of data distortion
High-performance Data Management for Whole Slide Image Analysis in Digital Pathology
When dealing with giga-pixel digital pathology in whole-slide imaging, a
notable proportion of data records holds relevance during each analysis
operation. For instance, when deploying an image analysis algorithm on
whole-slide images (WSI), the computational bottleneck often lies in the
input-output (I/O) system. This is particularly notable as patch-level
processing introduces a considerable I/O load onto the computer system.
However, this data management process could be further paralleled, given the
typical independence of patch-level image processes across different patches.
This paper details our endeavors in tackling this data access challenge by
implementing the Adaptable IO System version 2 (ADIOS2). Our focus has been
constructing and releasing a digital pathology-centric pipeline using ADIOS2,
which facilitates streamlined data management across WSIs. Additionally, we've
developed strategies aimed at curtailing data retrieval times. The performance
evaluation encompasses two key scenarios: (1) a pure CPU-based image analysis
scenario ("CPU scenario"), and (2) a GPU-based deep learning framework scenario
("GPU scenario"). Our findings reveal noteworthy outcomes. Under the CPU
scenario, ADIOS2 showcases an impressive two-fold speed-up compared to the
brute-force approach. In the GPU scenario, its performance stands on par with
the cutting-edge GPU I/O acceleration framework, NVIDIA Magnum IO GPU Direct
Storage (GDS). From what we know, this appears to be among the initial
instances, if any, of utilizing ADIOS2 within the field of digital pathology.
The source code has been made publicly available at
https://github.com/hrlblab/adios
Region-Adaptive, Error-Controlled Scientific Data Compression using Multilevel Decomposition
The increase of computer processing speed is significantly outpacing improvements in network and storage bandwidth, leading to the big data challenge in modern science, where scientific applications can quickly generate much more data than that can be transferred and stored. As a result, big scientific data must be reduced by a few orders of magnitude while the accuracy of the reduced data needs to be guaranteed for further scientific explorations. Moreover, scientists are often interested in some specific spatial/temporal regions in their data, where higher accuracy is required. The locations of the regions requiring high accuracy can sometimes be prescribed based on application knowledge, while other times they must be estimated based on general spatial/temporal variation. In this paper, we develop a novel multilevel approach which allows users to impose region-wise compression error bounds. Our method utilizes the byproduct of a multilevel compressor to detect regions where details are rich and we provide the theoretical underpinning for region-wise error control. With spatially varying precision preservation, our approach can achieve significantly higher compression ratios than single-error bounded compression approaches and control errors in the regions of interest. We conduct the evaluations on two climate use cases-one targeting small-scale, node features and the other focusing on long, areal features. For both use cases, the locations of the features were unknown ahead of the compression. By selecting approximately 16% of the data based on multi-scale spatial variations and compressing those regions with smaller error tolerances than the rest, our approach improves the accuracy of post-analysis by approximately 2 x compared to single-error-bounded compression at the same compression ratio. Using the same error bound for the region of interest, our approach can achieve an increase of more than 50% in overall compression ratio