Augmenting data warehousing architectures with hadoop

Abstract

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the volume of available data increases exponentially, traditional data warehouses struggle to transform this data into actionable knowledge. Data strategies that include the creation and maintenance of data warehouses have a lot to gain by incorporating technologies from the Big Data’s spectrum. Hadoop, as a transformation tool, can add a theoretical infinite dimension of data processing, feeding transformed information into traditional data warehouses that ultimately will retain their value as central components in organizations’ decision support systems. This study explores the potentialities of Hadoop as a data transformation tool in the setting of a traditional data warehouse environment. Hadoop’s execution model, which is oriented for distributed parallel processing, offers great capabilities when the amounts of data to be processed require the infrastructure to expand. Horizontal scalability, which is a key aspect in a Hadoop cluster, will allow for proportional growth in processing power as the volume of data increases. Through the use of a Hive on Tez, in a Hadoop cluster, this study transforms television viewing events, extracted from Ericsson’s Mediaroom Internet Protocol Television infrastructure, into pertinent audience metrics, like Rating, Reach and Share. These measurements are then made available in a traditional data warehouse, supported by a traditional Relational Database Management System, where they are presented through a set of reports. The main contribution of this research is a proposed augmented data warehouse architecture where the traditional ETL layer is replaced by a Hadoop cluster, running Hive on Tez, with the purpose of performing the heaviest transformations that convert raw data into actionable information. Through a typification of the SQL statements, responsible for the data transformation processes, we were able to understand that Hadoop, and its distributed processing model, delivers outstanding performance results associated with the analytical layer, namely in the aggregation of large data sets. Ultimately, we demonstrate, empirically, the performance gains that can be extracted from Hadoop, in comparison to an RDBMS, regarding speed, storage usage and scalability potential, and suggest how this can be used to evolve data warehouses into the age of Big Data

    Similar works