3 research outputs found
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchronous (often Bulk Synchronous
Parallel) dataflow execution model is at odds with an increasingly important
trend in the machine learning community: the use of asynchrony via shared,
mutable state (i.e., data races) in convex programming tasks, which has---in a
single-node context---delivered noteworthy empirical performance gains and
inspired new research into asynchronous algorithms. In this work, we attempt to
bridge this gap by evaluating the use of lightweight, asynchronous state
transfer within a commodity dataflow engine. Specifically, we investigate the
use of asynchronous sideways information passing (ASIP) that presents
single-stage parallel iterators with a Volcano-like intra-operator iterator
that can be used for asynchronous information passing. We port two synchronous
convex programming algorithms, stochastic gradient descent and the alternating
direction method of multipliers (ADMM), to use ASIPs. We evaluate an
implementation of ASIPs within on Apache Spark that exhibits considerable
speedups as well as a rich set of performance trade-offs in the use of these
asynchronous algorithms
Incremental Knowledge Base Construction Using DeepDive
Populating a database with unstructured information is a long-standing
problem in industry and research that encompasses problems of extraction,
cleaning, and integration. Recent names used for this problem include dealing
with dark data and knowledge base construction (KBC). In this work, we describe
DeepDive, a system that combines database and machine learning ideas to help
develop KBC systems, and we present techniques to make the KBC process more
efficient. We observe that the KBC process is iterative, and we develop
techniques to incrementally produce inference results for KBC systems. We
propose two methods for incremental inference, based respectively on sampling
and variational techniques. We also study the tradeoff space of these methods
and develop a simple rule-based optimizer. DeepDive includes all of these
contributions, and we evaluate DeepDive on five KBC systems, showing that it
can speed up KBC inference tasks by up to two orders of magnitude with
negligible impact on quality
Optimizing statistical information extraction programs over evolving text,” http://pages.cs.wisc.edu/ ∼fchen/papers/crflex-tr.pdf
Abstract — Statistical information extraction (IE) programs are increasingly used to build real-world IE systems such as Alibaba, CiteSeer, Kylin, and YAGO. Current statistical IE approaches consider the text corpora underlying the extraction program to be static. However, many real-world text corpora are dynamic (documents are inserted, modified, and removed). As the corpus evolves, and IE programs must be applied repeatedly to consecutive corpus snapshots to keep extracted information up to date. Applying IE from scratch to each snapshot may be inefficient: a pair of consecutive snapshots may change very little, but unaware of this, the program must run again from scratch. In this paper, we present CRFlex, a system that efficiently executes such repeated statistical IE, by recycling previous IE results to enable incremental update. We focus on statistical IE programs which use a leading statistical model, Conditional Random Fields (CRFs). We show how to model properties of the CRF inference algorithms for incremental update and how to exploit them to correctly recycle previous inference results. Then we show how to efficiently capture and store intermediate results of IE programs for subsequent recycling. We find that there is a tradeoff between the I/O cost spent on reading and writing intermediate results, and CPU cost we can save from recycling those intermediate results. Therefore we present a cost-based solution to determine the most efficient recycling approach for any given CRF-based IE program and an evolving corpus. We present extensive experiments with CRF-based IE programs for 3 IE tasks over a real-world data set to demonstrate the utility of our approach. I