145 research outputs found
Shared Arrangements: practical inter-query sharing for streaming dataflows
Current systems for data-parallel, incremental processing and view
maintenance over high-rate streams isolate the execution of independent
queries. This creates unwanted redundancy and overhead in the presence of
concurrent incrementally maintained queries: each query must independently
maintain the same indexed state over the same input streams, and new queries
must build this state from scratch before they can begin to emit their first
results. This paper introduces shared arrangements: indexed views of maintained
state that allow concurrent queries to reuse the same in-memory state without
compromising data-parallel performance and scaling. We implement shared
arrangements in a modern stream processor and show order-of-magnitude
improvements in query response time and resource consumption for interactive
queries against high-throughput streams, while also significantly improving
performance in other domains including business analytics, graph processing,
and program analysis
Recommended from our members
Efficient Latent Semantic Extraction from Cross Domain Data with Declarative Language
With large amounts of data continuously generated by intelligence devices, efficient analysis of huge data collections to unearth valuable insights has become one of the most elusive challenges for both academia and industry. The key elements to establishing a scalable analyzing framework should involve (1) an intuitive interface to describe the desired outcome, (2) a well-crafted model that integrates all available information sources to derive the optimal outcome and (3) an efficient algorithm that performs the data integration and extraction within a reasonable amount of time. In this dissertation, we address these challenges by proposing (1) a cross-language interface for a succinct expression of recursive queries, (2) a domain specific neural network model that can incorporate information of multiple modalities, and (3) a sample efficient training method that can be used even for extremely-large output-class classifiers. Our contributions in this thesis are thus threefold: First, for the ubiquitous recursive queries in advanced data analytics, on top of BigDatalog and Apache Spark, we design a succinct and expressive analytics tool encapsulating the functionality and classical algorithms of Datalog, a quintessential logic programming language. We provide the Logical Library (LLib), a Spark MLlib-like high-level API supporting a wide range of recursive algorithms and the Logical DataFrame (LFrame), an extension to Spark DataFrame supporting both relational and logical operations. The LLib and LFrame enable smooth collaborations between logical applications and other Spark libraries and cross-language logical programming in Scala, Java, or Python. Second, we utilize variants of recurrent neural network (RNN) to incorporate some enlightening sequential information overlooked by the conventional works in two different domains including Spoken Language Understanding (SLU) and Internet Embedding (IE). In SLU, we address the problem caused by solely relying on the first best interpretation (hypothesis) of an audio command through a series of new architectures comprising bidirectional LSTM and pooling layers to jointly utilize the other hypotheses' texts or embedding vectors, which are neglected but with valuable information missed by the first best hypothesis. In IE, we propose the DIP, an extension of RNN, to build up the internet coordinate system with the IP address sequences, which are also unnoticed in conventional distance-based internet embedding algorithms but encode structural information of the network. Both DIP and the integration of all hypotheses bring significant performance improvements for the corresponding downstream tasks. Finally, we investigate the training algorithm for multi-class classifiers with a large output-class size, which is common in deep neural networks and typically implemented as a softmax final layer with one output neuron per each class. To avoid expensive computing the intractable normalizing constant of softmax for each training data point, we analyze the well-known negative sampling and improve it to the amplified negative sampling algorithm, which gains much higher performance with lower training cost
Declarative Data Analytics: a Survey
The area of declarative data analytics explores the application of the
declarative paradigm on data science and machine learning. It proposes
declarative languages for expressing data analysis tasks and develops systems
which optimize programs written in those languages. The execution engine can be
either centralized or distributed, as the declarative paradigm advocates
independence from particular physical implementations. The survey explores a
wide range of declarative data analysis frameworks by examining both the
programming model and the optimization techniques used, in order to provide
conclusions on the current state of the art in the area and identify open
challenges.Comment: 36 pages, 2 figure
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
- …