2,411 research outputs found
Declarative Data Analytics: a Survey
The area of declarative data analytics explores the application of the
declarative paradigm on data science and machine learning. It proposes
declarative languages for expressing data analysis tasks and develops systems
which optimize programs written in those languages. The execution engine can be
either centralized or distributed, as the declarative paradigm advocates
independence from particular physical implementations. The survey explores a
wide range of declarative data analysis frameworks by examining both the
programming model and the optimization techniques used, in order to provide
conclusions on the current state of the art in the area and identify open
challenges.Comment: 36 pages, 2 figure
Time Series Management Systems:A Survey
The collection of time series data increases as more monitoring and
automation are being deployed. These deployments range in scale from an
Internet of things (IoT) device located in a household to enormous distributed
Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity.
To store and analyze these vast amounts of data, specialized Time Series
Management Systems (TSMSs) have been developed to overcome the limitations of
general purpose Database Management Systems (DBMSs) for times series
management. In this paper, we present a thorough analysis and classification of
TSMSs developed through academic or industrial research and documented through
publications. Our classification is organized into categories based on the
architectures observed during our analysis. In addition, we provide an overview
of each system with a focus on the motivational use case that drove the
development of the system, the functionality for storage and querying of time
series a system implements, the components the system is composed of, and the
capabilities of each system with regard to Stream Processing and Approximate
Query Processing (AQP). Last, we provide a summary of research directions
proposed by other researchers in the field and present our vision for a next
generation TSMS.Comment: 20 Pages, 15 Figures, 2 Tables, Accepted for publication in IEEE TKD
Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform
Advances in detectors and computational technologies provide new
opportunities for applied research and the fundamental sciences. Concurrently,
dramatic increases in the three Vs (Volume, Velocity, and Variety) of
experimental data and the scale of computational tasks produced the demand for
new real-time processing systems at experimental facilities. Recently, this
demand was addressed by the Spark-MPI approach connecting the Spark
data-intensive platform with the MPI high-performance framework. In contrast
with existing data management and analytics systems, Spark introduced a new
middleware based on resilient distributed datasets (RDDs), which decoupled
various data sources from high-level processing algorithms. The RDD middleware
significantly advanced the scope of data-intensive applications, spreading from
SQL queries to machine learning to graph processing. Spark-MPI further extended
the Spark ecosystem with the MPI applications using the Process Management
Interface. The paper explores this integrated platform within the context of
online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201
Efficient Data Management and Statistics with Zero-Copy Integration
Statistical analysts have long been struggling with evergrowing
data volumes. While specialized data management
systems such as relational databases would be able to handle
the data, statistical analysis tools are far more convenient to
express complex data analyses. An integration of these two
classes of systems has the potential to overcome the data
management issue while at the same time keeping analysis
convenient. However, one must keep a careful eye on implementation
overheads such as serialization. In this paper, we
propose the in-process integration of data management and
analytical tools. Furthermore, we argue that a zero-copy integration
is feasible due to the omnipresence of C-style arrays
containing native types. We discuss the general concept and
present a prototype of this integration based on the columnar
relational database MonetDB and the R environment for
statistical computing. We evaluate the performance of this
prototype in a series of micro-benchmarks of common data
management tasks
Big Data Analytics for Earth Sciences: the EarthServer approach
Big Data Analytics is an emerging field since massive storage and computing capabilities have been made available by advanced e-infrastructures. Earth and Environmental sciences are likely to benefit from Big Data Analytics techniques supporting the processing of the large number of Earth Observation datasets currently acquired and generated through observations and simulations. However, Earth Science data and applications present specificities in terms of relevance of the geospatial information, wide heterogeneity of data models and formats, and complexity of processing. Therefore, Big Earth Data Analytics requires specifically tailored techniques and tools. The EarthServer Big Earth Data Analytics engine offers a solution for coverage-type datasets, built around a high performance array database technology, and the adoption and enhancement of standards for service interaction (OGC WCS and WCPS). The EarthServer solution, led by the collection of requirements from scientific communities and international initiatives, provides a holistic approach that ranges from query languages and scalability up to mobile access and visualization. The result is demonstrated and validated through the development of lighthouse applications in the Marine, Geology, Atmospheric, Planetary and Cryospheric science domains
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
- …