17 research outputs found

    Striving towards Near Real-Time Data Integration for Data Warehouses

    Full text link
    Abstract. The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in time. A real-time data warehouse aims at decreasing the time it takes to make business decisions and tries to attain zero latency between the cause and effect of a business decision. In this paper we present an architecture of an ETL environment for real-time data warehouses, which supports a continual near real-time data propagation. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. Instead of using vendor proprietary ETL (extraction, transformation, loading) solutions, which are often hard to scale and often do not support an optimization of allocated time frames for data extracts, we propose in our approach ETLets (spoken “et-lets”) and Enterprise Java Beans (EJB) for the ETL processing tasks. 1

    An event-based near real-time data integration architecture

    Get PDF
    Extract-Transform-Load (ETL) tools feed data from operational databases into data warehouses. Traditionally, these ETL tools use batch processing and operate offline at regular time intervals, for example on a nightly or weekly basis. Naturally, users prefer to have up-to-date data to make their decisions, therefore there is a demand for real-time ETL tools. In this paper we investigate an event-based near real-time ETL layer for transferring and transforming data from the operational database to the data warehouse. One of our main concerns in this paper is master data management in the ETL layer. We present the architecture of a novel, general purpose, event-driven, and near real-time ETL layer that uses a Database Queue (DBQ), works on a push technology principle and directly supports content enrichment. We also observe that the system architecture is consistent with the information architecture of a classical Online Transaction Processing (OLTP) application, allowing us to distinguish between different kinds of data to increase the clarity of the design. Keywords: event-based architecture, content enrichment, master data, extract-transform-load, enterprise service bus

    Grouping and joining transformations in the data extraction process

    Get PDF
    In this paper we present a method of describing ETL processes (Extraction, Transformation and Loading) using graphs. We focus on implementation aspects such as division of a whole process into threads, communication and data exchange between threads, deadlock prevention. Methods of processing of large data sets using insufficient memory resources are also presented upon examples of joining and grouping nodes. Our solution is compared to the efficiency of the OS-level virtual memory in a few tests. Their results are presented and discussed

    Framework for Interoperable and Distributed Extraction-Transformation-Loading (ETL) Based on Service Oriented Architecture

    Get PDF
    Extraction. Transformation and Loading (ETL) are the major functionalities in data warehouse (DW) solutions. Lack of component distribution and interoperability is a gap that leads to many problems in the ETL domain, which is due to tightly-coupled components in the current ETL framework. This research discusses how to distribute the Extraction, Transformation and Loading components so as to achieve distribution and interoperability of these ETL components. In addition, it shows how the ETL framework can be extended. To achieve that, Service Oriented Architecture (SOA) is adopted to address the mentioned missing features of distribution and interoperability by restructuring the current ETL framework. This research contributes towards the field of ETL by adding the distribution and inter- operability concepts to the ETL framework. This Ieads to contributions towards the area of data warehousing and business intelligence, because ETL is a core concept in this area. The Design Science Approach (DSA) and Scrum methodologies were adopted for achieving the research goals. The integration of DSA and Scrum provides the suitable methods for achieving the research objectives. The new ETL framework is realized by developing and testing a prototype that is based on the new ETL framework. This prototype is successfully evaluated using three case studies that are conducted using the data and tools of three different organizations. These organizations use data warehouse solutions for the purpose of generating statistical reports that help their top management to take decisions. Results of the case studies show that distribution and interoperability can be achieved by using the new ETL framework

    Data virtualization design model for near real time decision making in business intelligence environment

    Get PDF
    The main purpose of Business Intelligence (BI) is to focus on supporting an organization‘s strategic, operational and tactical decisions by providing comprehensive, accurate and vivid data to the decision makers. A data warehouse (DW), which is considered as the input for decision making system activities is created through a complex process known as Extract, Transform and Load (ETL). ETL operates at pre-defined times and requires time to process and transfer data. However, providing near real time information to facilitate the data integration in supporting decision making process is a known issue. Inaccessibility to near realtime information could be overcome with Data Virtualization (DV) as it provides unified, abstracted, near real time, and encapsulated view of information for querying. Nevertheless, currently, there are lack of studies on the BI model for developing and managing data in virtual manner that can fulfil the organization needs. Therefore, the main aim of this study is to propose a DV model for near-real time decision making in BI environment. Design science research methodology was adopted to accomplish the research objectives. As a result of this study, a model called Data Virtualization Development Model (DVDeM) is proposed that addresses the phases and components which affect the BI environment. To validate the model, expert reviews and focus group discussions were conducted. A prototype based on the proposed model was also developed, and then implemented in two case studies. Also, an instrument was developed to measure the usability of the prototype in providing near real time data. In total, 60 participants were involved and the findings indicated that 93% of the participants agreed that the DVDeM based prototype was able to provide near real-time data for supporting decision-making process. From the studies, the findings also showed that the majority of the participants (more than 90%) in both of education and business sectors, have affirmed the workability of the DVDeM and the usability of the prototype in particular able to deliver near real-time decision-making data. Findings also indicate theoretical and practical contributions for developers to develop efficient BI applications using DV technique. Also, the mean values for each measurement item are greater than 4 indicating that the respondents agreed with the statement for each measurement item. Meanwhile, it was found that the mean scores for overall usability attributes of DVDeM design model fall under "High" or "Fairly High". Therefore, the results show sufficient indications that by adopting DVDeM model in developing a system, the usability of the produced system is perceived by the majority of respondents as high and is able to support near real time decision making data

    A distributed tree data structure for real-time OLAP on cloud architectures

    Get PDF
    In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we introduce CR-OLAP, a Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree, that utilizes a cloud infrastructure consisting of (m + 1) multi-core processors. With increasing database size, CR-OLAP dynamically increases m to maintain performance. Our distributed PDCR tree data structure supports multiple dimension hierarchies and efficient query processing on the elaborate dimension hierarchies which are so central to OLAP systems. It is particularly efficient for complex OLAP queries that need to aggregate large portions of the data warehouse, such as 'report the total sales in all stores located in California and New York during the months February-May of all years'. We evaluated CR-OLAP on the Amazon EC2 cloud, using the TPC-DS benchmark data set. The tests demonstrate that CR-OLAP scales well with increasing number of processors, even for complex queries. For example, on an Amazon EC2 cloud instance with eight processors, for a TPC-DS OLAP query stream on a data warehouse with 80 million tuples where every OLAP query aggregates more than 50% of the database, CR-OLAP achieved a query latency of 0.3 seconds which can be considered a real time response

    MESHJOIN*:An Algorithm Supporting Streaming Updates in a Real-time Data Warehouse

    Get PDF
    提出了一种新的实时数据仓库环境下的数据流更新算法——MESHJOIn*算法。算法的特性有:(1)关系r采用了分块和散列的组织形式,尽可能避免对当前连接无效元组的读取,减少连接操作所涉及元组的数量,从而提高连接算法的效率;(2)采用了多线程并发连接技术,并根据工程学原理,实现了连接操作和关系r读取操作的最佳调度,保证了连接算法效率的最大化;(3)根据当前系统的服务率和数据流元组的到达率之间的关系,合理调度实时元组和准实时元组的执行,保证了系统对实时元组的处理要求。实验结果表明,MESHJOIn*算法可以取得比MESHJOIn算法更好的性能。A new algorithm called MESHJOIN* is proposed to support streaming updates under real-time data warehouse environment.It has the following distinct features:(1) Relation R is organized in blocks and hashes so as to avoid the reading of unusable tuples for the current join operation as much as possible,through which the amount of tuples involved in a join is much reduced,thus enhancing the efficiency of the join operation;(2) Multi-thread parallel execution technology is adopted here,and the order of read operation and join operation is optimized according to engineering theory so as to maximize the efficiency of join algorithm;(3) Reasonable scheduling of real-time tuples and near-real-time tuples is achieved according to the relationship between the current system service rate and the tuples arriving rate,so that the requirement for the processing of real-time tuples is satisfied.Experimental results show that MESHJOIN* can achieve much better performance than MESHJOIN.国家自然科学基金No.50604012---

    MESHJOIN*: An Algorithm Supporting Streaming Updates in a Real-time Data Warehouse

    Get PDF
    的方法和基于模式图的方法,并详细介绍了各种方法的原理以及各自的优缺点;最后展望了未来的研究方向.The National Natural Science Foundation of China under Grant No.50604012 (国家自然科学基金
    corecore