1,143 research outputs found
Grids and the Virtual Observatory
We consider several projects from astronomy that benefit from the Grid paradigm and
associated technology, many of which involve either massive datasets or the federation
of multiple datasets. We cover image computation (mosaicking, multi-wavelength
images, and synoptic surveys); database computation (representation through XML,
data mining, and visualization); and semantic interoperability (publishing, ontologies,
directories, and service descriptions)
The use of alternative data models in data warehousing environments
Data Warehouses are increasing their data volume at an accelerated rate; high disk
space consumption; slow query response time and complex database administration are
common problems in these environments. The lack of a proper data model and an
adequate architecture specifically targeted towards these environments are the root
causes of these problems.
Inefficient management of stored data includes duplicate values at column level and
poor management of data sparsity which derives from a low data density, and affects
the final size of Data Warehouses. It has been demonstrated that the Relational Model
and Relational technology are not the best techniques for managing duplicates and data
sparsity.
The novelty of this research is to compare some data models considering their data
density and their data sparsity management to optimise Data Warehouse environments.
The Binary-Relational, the Associative/Triple Store and the Transrelational models
have been investigated and based on the research results a novel Alternative Data
Warehouse Reference architectural configuration has been defined.
For the Transrelational model, no database implementation existed. Therefore it was
necessary to develop an instantiation of it’s storage mechanism, and as far as could be
determined this is the first public domain instantiation available of the storage
mechanism for the Transrelational model
Joining Entities Across Relation and Graph with a Unified Model
This paper introduces RG (Relational Genetic) model, a revised relational
model to represent graph-structured data in RDBMS while preserving its
topology, for efficiently and effectively extracting data in different formats
from disparate sources. Along with: (a) SQL, an SQL dialect augmented
with graph pattern queries and tuple-vertex joins, such that one can extract
graph properties via graph pattern matching, and "semantically" match entities
across relations and graphs; (b) a logical representation of graphs in RDBMS,
which introduces an exploration operator for efficient pattern querying,
supports also browsing and updating graph-structured data; and (c) a strategy
to uniformly evaluate SQL, pattern and hybrid queries that join tuples and
vertices, all inside an RDBMS by leveraging its optimizer without performance
degradation on switching different execution engines. A lightweight system,
WhiteDB, is developed as an implementation to evaluate the benefits it can
actually bring on real-life data. We empirically verified that the RG model
enables the graph pattern queries to be answered as efficiently as in native
graph engines; can consider the access on graph and relation in any order for
optimal plan; and supports effective data enrichment.Comment: 24 pages, 16 figures, 5 table
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Augmenting data warehousing architectures with hadoop
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the volume of available data increases exponentially, traditional data warehouses struggle to transform this data into actionable knowledge. Data strategies that include the creation and maintenance of data warehouses have a lot to gain by incorporating technologies from the Big Data’s spectrum. Hadoop, as a transformation tool, can add a theoretical infinite dimension of data processing, feeding transformed information into traditional data warehouses that ultimately will retain their value as central components in organizations’ decision support systems.
This study explores the potentialities of Hadoop as a data transformation tool in the setting of a traditional data warehouse environment. Hadoop’s execution model, which is oriented for distributed parallel processing, offers great capabilities when the amounts of data to be processed require the infrastructure to expand. Horizontal scalability, which is a key aspect in a Hadoop cluster, will allow for proportional growth in processing power as the volume of data increases.
Through the use of a Hive on Tez, in a Hadoop cluster, this study transforms television viewing events, extracted from Ericsson’s Mediaroom Internet Protocol Television infrastructure, into pertinent audience metrics, like Rating, Reach and Share. These measurements are then made available in a traditional data warehouse, supported by a traditional Relational Database Management System, where they are presented through a set of reports.
The main contribution of this research is a proposed augmented data warehouse architecture where the traditional ETL layer is replaced by a Hadoop cluster, running Hive on Tez, with the purpose of performing the heaviest transformations that convert raw data into actionable information. Through a typification of the SQL statements, responsible for the data transformation processes, we were able to understand that Hadoop, and its distributed processing model, delivers outstanding performance results associated with the analytical layer, namely in the aggregation of large data sets.
Ultimately, we demonstrate, empirically, the performance gains that can be extracted from Hadoop, in comparison to an RDBMS, regarding speed, storage usage and scalability potential, and suggest how this can be used to evolve data warehouses into the age of Big Data
- …