14,014 research outputs found
Parallel clustering of high-dimensional social media data streams
We introduce Cloud DIKW as an analysis environment supporting scientific
discovery through integrated parallel batch and streaming processing, and apply
it to one representative domain application: social media data stream
clustering. Recent work demonstrated that high-quality clusters can be
generated by representing the data points using high-dimensional vectors that
reflect textual content and social network information. Due to the high cost of
similarity computation, sequential implementations of even single-pass
algorithms cannot keep up with the speed of real-world streams. This paper
presents our efforts to meet the constraints of real-time social stream
clustering through parallelization. We focus on two system-level issues. Most
stream processing engines like Apache Storm organize distributed workers in the
form of a directed acyclic graph, making it difficult to dynamically
synchronize the state of parallel workers. We tackle this challenge by creating
a separate synchronization channel using a pub-sub messaging system. Due to the
sparsity of the high-dimensional vectors, the size of centroids grows quickly
as new data points are assigned to the clusters. Traditional synchronization
that directly broadcasts cluster centroids becomes too expensive and limits the
scalability of the parallel algorithm. We address this problem by communicating
only dynamic changes of the clusters rather than the whole centroid vectors.
Our algorithm under Cloud DIKW can process the Twitter 10% data stream in
real-time with 96-way parallelism. By natural improvements to Cloud DIKW,
including advanced collective communication techniques developed in our Harp
project, we will be able to process the full Twitter stream in real-time with
1000-way parallelism. Our use of powerful general software subsystems will
enable many other applications that need integration of streaming and batch
data analytics.Comment: IEEE/ACM CCGrid 2015: 15th IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing, 201
Scalable Architecture for Integrated Batch and Streaming Analysis of Big Data
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2015As Big Data processing problems evolve, many modern applications demonstrate special characteristics. Data exists in the form of both large historical datasets and high-speed real-time streams, and many analysis pipelines require integrated parallel batch processing and stream processing. Despite the large size of the whole dataset, most analyses focus on specific subsets according to certain criteria. Correspondingly, integrated support for efficient queries and post- query analysis is required.
To address the system-level requirements brought by such characteristics, this dissertation proposes a scalable architecture for integrated queries, batch analysis, and streaming analysis of Big Data in the cloud. We verify its effectiveness using a representative application domain - social media data analysis - and tackle related research challenges emerging from each module of the architecture by integrating and extending multiple state-of-the-art Big Data storage and processing systems.
In the storage layer, we reveal that existing text indexing techniques do not work well for the unique queries of social data, which put constraints on both textual content and social context. To address this issue, we propose a flexible indexing framework over NoSQL databases to support fully customizable index structures, which can embed necessary social context information for efficient queries.
The batch analysis module demonstrates that analysis workflows consist of multiple algorithms with different computation and communication patterns, which are suitable for different processing frameworks. To achieve efficient workflows, we build an integrated analysis stack based on YARN, and make novel use of customized indices in developing sophisticated analysis algorithms.
In the streaming analysis module, the high-dimensional data representation of social media streams poses special challenges to the problem of parallel stream clustering. Due to the sparsity of the high-dimensional data, traditional synchronization method becomes expensive and severely impacts the scalability of the algorithm. Therefore, we design a novel strategy that broadcasts the incremental changes rather than the whole centroids of the clusters to achieve scalable parallel stream clustering algorithms.
Performance tests using real applications show that our solutions for parallel data loading/indexing, queries, analysis tasks, and stream clustering all significantly outperform implementations using current state-of-the-art technologies
Recommended from our members
A Clustering System for Dynamic Data Streams Based on Metaheuristic Optimisation
open access articleThis article presents the Optimised Stream clustering algorithm (OpStream), a novel approach to cluster dynamic data streams. The proposed system displays desirable features, such as a low number of parameters and good scalability capabilities to both high-dimensional data and numbers of clusters in the dataset, and it is based on a hybrid structure using deterministic clustering methods and stochastic optimisation approaches to optimally centre the clusters. Similar to other state-of-the-art methods available in the literature, it uses “microclusters” and other established techniques, such as density based clustering. Unlike other methods, it makes use of metaheuristic optimisation to maximise performances during the initialisation phase, which precedes the classic online phase. Experimental results show that OpStream outperforms the state-of-the-art methods in several cases, and it is always competitive against other comparison algorithms regardless of the chosen optimisation method. Three variants of OpStream, each coming with a different optimisation algorithm, are presented in this study. A thorough sensitive analysis is performed by using the best variant to point out OpStream’s robustness to noise and resiliency to parameter changes
- …