2,192 research outputs found
Real-Time Data Processing With Lambda Architecture
Data has evolved immensely in recent years, in type, volume and velocity. There are several frameworks to handle the big data applications. The project focuses on the Lambda Architecture proposed by Marz and its application to obtain real-time data processing. The architecture is a solution that unites the benefits of the batch and stream processing techniques. Data can be historically processed with high precision and involved algorithms without loss of short-term information, alerts and insights. Lambda Architecture has an ability to serve a wide range of use cases and workloads that withstands hardware and human mistakes. The layered architecture enhances loose coupling and flexibility in the system. This a huge benefit that allows understanding the trade-offs and application of various tools and technologies across the layers. There has been an advancement in the approach of building the LA due to improvements in the underlying tools. The project demonstrates a simplified architecture for the LA that is maintainable
Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service
An increasing number of Analytics-as-a-Service solutions has recently seen
the light, in the landscape of cloud-based services. These services allow
flexible composition of compute and storage components, that create powerful
data ingestion and processing pipelines. This work is a first attempt at an
experimental evaluation of analytic application performance executed using a
wide range of storage service configurations. We present an intuitive notion of
data locality, that we use as a proxy to rank different service compositions in
terms of expected performance. Through an empirical analysis, we dissect the
performance achieved by analytic workloads and unveil problems due to the
impedance mismatch that arise in some configurations. Our work paves the way to
a better understanding of modern cloud-based analytic services and their
performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1
Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Data collection for scientific applications is increasing exponentially and
is forecasted to soon reach peta- and exabyte scales. Applications which
process and analyze scientific data must be scalable and focus on execution
performance to keep pace. In the field of radio astronomy, in addition to
increasingly large datasets, tasks such as the identification of transient
radio signals from extrasolar sources are computationally expensive. We present
a scalable approach to radio pulsar detection written in Scala that
parallelizes candidate identification to take advantage of in-memory task
processing using Apache Spark on a YARN distributed system. Furthermore, we
introduce a novel automated multiclass supervised machine learning technique
that we combine with feature selection to reduce the time required for
candidate classification. Experimental testing on a Beowulf cluster with 15
data nodes shows that the parallel implementation of the identification
algorithm offers a speedup of up to 5X that of a similar multithreaded
implementation. Further, we show that the combination of automated multiclass
classification and feature selection speeds up the execution performance of the
RandomForest machine learning algorithm by an average of 54% with less than a
2% average reduction in the algorithm's ability to correctly classify pulsars.
The generalizability of these results is demonstrated by using two real-world
radio astronomy data sets.Comment: In Proceedings of the 47th International Conference on Parallel
Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 page
- …