FlashX: Massive Data Analysis Using Fast I/O

Abstract

With the explosion of data and the increasing complexity of data analysis, large-scale data analysis imposes significant challenges in systems design. While current research focuses on scaling out to large clusters, these scale-out solutions introduce a significant amount of overhead. This thesis is motivated by the advance of new I/O technologies such as flash memory. Instead of scaling out, we explore efficient system designs in a single commodity machine with non-uniform memory architecture (NUMA) and scale to large datasets by utilizing commodity solid-state drives (SSDs). This thesis explores the impact of the new I/O technologies on large-scale data analysis. Instead of implementing individual data analysis algorithms for SSDs, we develop a data analysis ecosystem called FlashX to target a large range of data analysis tasks. FlashX includes three subsystems: SAFS, FlashGraph and FlashMatrix. SAFS is a user-space filesystem optimized for a large SSD array to deliver maximal I/O throughput from SSDs. FlashGraph is a general-purpose graph analysis framework that processes graphs in a semi-external memory fashion, i.e., keeping vertex state in memory and edges on SSDs, and scales to graphs with billions of vertices by utilizing SSDs through SAFS. FlashMatrix is a matrix-oriented programming framework that supports both sparse matrices and dense matrices for general data analysis. Similar to FlashGraph, it scales matrix operations beyond memory capacity by utilizing SSDs. We demonstrate that with the current I/O technologies FlashGraph and FlashMatrix in the (semi-)external-memory meets or even exceeds state-of-the-art in-memory data analysis frameworks while scaling to massive datasets for a large variety of data analysis tasks

    Similar works