Data warehouse technology has been successfully integrated into the information\ud infrastructure of major organizations as potential solution for eliminating redundancy and\ud providing for comprehensive data integration. Realizing the importance of a data\ud warehouse as the main data repository within an organization, this dissertation addresses\ud different aspects related to the data warehouse architecture and performance issues.\ud Many data warehouse architectures have been presented by industry analysts and\ud research organizations. These architectures vary from the independent and physical\ud business unit centric data marts to the centralised two-tier hub-and-spoke data warehouse.\ud The operational data store is a third tier which was offered later to address the business\ud requirements for inter-day data loading. While the industry-available architectures are all\ud valid, I found them to be suboptimal in efficiency (cost) and effectiveness (productivity).\ud In this dissertation, I am advocating a new architecture (The Hybrid Architecture)\ud which encompasses the industry advocated architecture. The hybrid architecture demands\ud the acquisition, loading and consolidation of enterprise atomic and detailed data into a\ud single integrated enterprise data store (The Enterprise Data Warehouse) where businessunit\ud centric Data Marts and Operational Data Stores (ODS) are built in the same instance\ud of the Enterprise Data Warehouse.\ud For the purpose of highlighting the role of data warehouses for different\ud applications, we describe an effort to develop a data warehouse for a geographical\ud information system (GIS). We further study the importance of data practices, quality and\ud governance for financial institutions by commenting on the RBC Financial Group case.\ud v\ud The development and deployment of the Enterprise Data Warehouse based on the\ud Hybrid Architecture spawned its own issues and challenges. Organic data growth and\ud business requirements to load additional new data significantly will increase the amount\ud of stored data. Consequently, the number of users will increase significantly. Enterprise\ud data warehouse obesity, performance degradation and navigation difficulties are chief\ud amongst the issues and challenges.\ud Association rules mining and social networks have been adopted in this thesis to\ud address the above mentioned issues and challenges. We describe an approach that uses\ud frequent pattern mining and social network techniques to discover different communities\ud within the data warehouse. These communities include sets of tables frequently accessed\ud together, sets of tables retrieved together most of the time and sets of attributes that\ud mostly appear together in the queries. We concentrate on tables in the discussion;\ud however, the model is general enough to discover other communities. We first build a\ud frequent pattern mining model by considering each query as a transaction and the tables\ud as items. Then, we mine closed frequent itemsets of tables; these itemsets include tables\ud that are mostly accessed together and hence should be treated as one unit in storage and\ud retrieval for better overall performance. We utilize social network construction and\ud analysis to find maximum-sized sets of related tables; this is a more robust approach as\ud opposed to a union of overlapping itemsets. We derive the Jaccard distance between the\ud closed itemsets and construct the social network of tables by adding links that represent\ud distance above a given threshold. The constructed network is analyzed to discover\ud communities of tables that are mostly accessed together. The reported test results are\ud promising and demonstrate the applicability and effectiveness of the developed approach
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.