73,842 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Experience with the Open Source based implementation for ATLAS Conditions Data Management System
Conditions Data in high energy physics experiments is frequently seen as
every data needed for reconstruction besides the event data itself. This
includes all sorts of slowly evolving data like detector alignment, calibration
and robustness, and data from detector control system. Also, every Conditions
Data Object is associated with a time interval of validity and a version.
Besides that, quite often is useful to tag collections of Conditions Data
Objects altogether. These issues have already been investigated and a data
model has been proposed and used for different implementations based in
commercial DBMSs, both at CERN and for the BaBar experiment. The special case
of the ATLAS complex trigger that requires online access to calibration and
alignment data poses new challenges that have to be met using a flexible and
customizable solution more in the line of Open Source components. Motivated by
the ATLAS challenges we have developed an alternative implementation, based in
an Open Source RDBMS. Several issues were investigated land will be described
in this paper:
-The best way to map the conditions data model into the relational database
concept considering what are foreseen as the most frequent queries.
-The clustering model best suited to address the scalability problem.
-Extensive tests were performed and will be described.
The very promising results from these tests are attracting the attention from
the HEP community and driving further developments.Comment: 8 pages, 4 figures, 3 tables, conferenc
Image databases: Problems and perspectives
With the increasing number of computer graphics, image processing, and pattern recognition applications, economical storage, efficient representation and manipulation, and powerful and flexible query languages for retrieval of image data are of paramount importance. These and related issues pertinent to image data bases are examined
Curriculum Guidelines for Undergraduate Programs in Data Science
The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program
met for the purpose of composing guidelines for undergraduate programs in Data
Science. The group consisted of 25 undergraduate faculty from a variety of
institutions in the U.S., primarily from the disciplines of mathematics,
statistics and computer science. These guidelines are meant to provide some
structure for institutions planning for or revising a major in Data Science
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
- …