In this paper, we evaluate Apache Spark for a data-intensive machine learning
problem. Our use case focuses on policy diffusion detection across the state
legislatures in the United States over time. Previous work on policy diffusion
has been unable to make an all-pairs comparison between bills due to
computational intensity. As a substitute, scholars have studied single topic
areas.
We provide an implementation of this analysis workflow as a distributed text
processing pipeline with Spark dataframes and Scala application programming
interface. We discuss the challenges and strategies of unstructured data
processing, data formats for storage and efficient access, and graph processing
at scale