Making State Explicit for Imperative Big Data Processing

Abstract

Data scientists often implement machine learning algo- rithms in imperative languages such as Java, Matlab and R. Yet such implementations fail to achieve the per- formance and scalability of specialised data-parallel pro- cessing frameworks. Our goal is to execute impera- tive Java programs in a data-parallel fashion with high throughput and low latency. This raises two challenges: how to support the arbitrary mutable state of Java pro- grams without compromising scalability, and how to re- cover that state after failure with low overhead. Our idea is to infer the dataflow and the types of state accesses from a Java program and use this information to generate a stateful dataflow graph (SDG). By explic- itly separating data from mutable state, SDGs have spe- cific features to enable this translation: to ensure scala- bility, distributed state can be partitioned across nodes if computation can occur entirely in parallel; if this is not possible, partial state gives nodes local instances for in- dependent computation, which are reconciled according to application semantics. For fault tolerance, large in- memory state is checkpointed asynchronously without global coordination. We show that the performance of SDGs for several imperative online applications matches that of existing data-parallel processing frameworks

    Similar works