35 research outputs found

    A framework for detecting unnecessary industrial data in ETL processes

    Get PDF
    Extract transform and load (ETL) is a critical process used by industrial organisations to shift data from one database to another, such as from an operational system to a data warehouse. With the increasing amount of data stored by industrial organisations, some ETL processes can take in excess of 12 hours to complete; this can leave decision makers stranded while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements can change, and much of the data that goes through the ETL process may not ever be used or needed. This paper therefore proposes a framework for dynamically detecting and predicting unnecessary data and preventing it from slowing down ETL processes - either by removing it entirely or deprioritizing it. Other advantages of the framework include being able to prioritise data cleansing tasks and determining what data should be processed first and placed into fast access memory. We show existing example algorithms that can be used for each component of the framework, and present some initial testing results as part of our research to determine whether the framework can help to reduce ETL time.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/INDIN.2014.694555

    GCIP: Exploiting the Generation and Optimization of Integration Processes

    Get PDF
    As a result of the changing scope of data management towards the management of highly distributed systems and applications, integration processes have gained in importance. Such integration processes represent an abstraction of workflow-based integration tasks. In practice, integration processes are pervasive and the performance of complete IT infrastructures strongly depends on the performance of the central integration platform that executes the specified integration processes. In this area, the three major problems are: (1) significant development efforts, (2) low portability, and (3) inefficient execution. To overcome those problems, we follow a model-driven generation approach for integration processes. In this demo proposal, we want to introduce the so-called GCIP Framework (Generation of Complex Integration Processes) which allows the modeling of integration process and the generation of different concrete integration tasks. The model-driven approach opens opportunities for rule-based and workload-based optimization techniques

    Desain Etl Dengan Contoh Kasus Perguruan Tinggi

    Full text link
    Data Warehouse for higher education as a paradigm for helping high management in order to make an effective and efficient strategic decisions based on reliable and trusted reports which is produced from Data Warehouse itself. Data Warehouse is not a software, hardware or tool but Data Warehouse is an environment where the transactional database is modelled in other view for decision making purposes. ETL (Extraction, Transformation and Loading) is a bridge to build Data Warehouse and transform data from transactional database. In every fact and dimension table will be inserted with fields which represent the construction merge loading as an ETL (Extraction, Transformation and Loading) extraction. ETL needs an ETL table and ETL process where ETL table as table connectivity between tables in OLTP database and tables in Data Warehouse and ETL process will transform data from table in OLTP database into Data Warehouse table based on ETL table. The extraction process will be run with a table database as differentiate ETL process and an ETL algorithm which will be run automatically in idle transactional process, along with daily transactional database backup when the information system are not used

    Colored Petri nets in the simulation of ETL standard tasks: the surrogate key pipelining case

    Get PDF
    ETL (Extract-Transform-Load) systems are formed by processes responsible for the extraction of data from several sources, cleaning and transforming it in accordance with some prerequisites of a data warehouse, and finally loading it in its multidimensional structures. ETL processes are the most complex tasks involved within the development of a Data Warehousing System, being crucial to model them previously so that, during the implementation stage, the correct set of requirements is considered. Coloured Petri Nets are a graphical modelling language used in the design, specification, simulation and validation of large systems, characterized as being strongly concurrent. The objective of this manuscript is to discuss the application of Coloured Petri Nets to the specification and validation of ETL systems. To demonstrate their viability for such tasks we have selected one of the most relevant and used case in ETL systems implementation: a surrogate key pipelining

    Invisible Deployment of Integration Processes

    Get PDF
    Due to the changing scope of data management towards the management of heterogeneous and distributed systems and applications, integration processes gain in importance. This is particularly true for those processes used as abstractions of workflow-based integration tasks; these are widely applied in practice. In such scenarios, a typical IT infrastructure comprises multiple integration systems with overlapping functionalities. The major problems in this area are high development effort, low portability and inefficiency. Therefore, in this paper, we introduce the vision of invisible deployment that addresses the virtualization of multiple, heterogeneous, physical integration systems into a single logical integration system. This vision comprises several challenging issues in the fields of deployment aspects as well as runtime aspects. Here, we describe those challenges, discuss possible solutions and present a detailed system architecture for that approach. As a result, the development effort can be reduced and the portability as well as the performance can be improved significantly

    Using Cloud Technologies to Optimize Data-Intensive Service Applications

    Get PDF
    The role of data analytics increases in several application domains to cope with the large amount of captured data. Generally, data analytics are data-intensive processes, whose efficient execution is a challenging task. Each process consists of a collection of related structured activities, where huge data sets have to be exchanged between several loosely coupled services. The implementation of such processes in a service-oriented environment offers some advantages, but the efficient realization of data flows is difficult. Therefore, we use this paper to propose a novel SOA-aware approach with a special focus on the data flow. The tight interaction of new cloud technologies with SOA technologies enables us to optimize the execution of data-intensive service applications by reducing the data exchange tasks to a minimum. Fundamentally, our core concept to optimize the data flows is found in data clouds. Moreover, we can exploit our approach to derive efficient process execution strategies regarding different optimization objectives for the data flows

    Framework for Interoperable and Distributed Extraction-Transformation-Loading (ETL) Based on Service Oriented Architecture

    Get PDF
    Extraction. Transformation and Loading (ETL) are the major functionalities in data warehouse (DW) solutions. Lack of component distribution and interoperability is a gap that leads to many problems in the ETL domain, which is due to tightly-coupled components in the current ETL framework. This research discusses how to distribute the Extraction, Transformation and Loading components so as to achieve distribution and interoperability of these ETL components. In addition, it shows how the ETL framework can be extended. To achieve that, Service Oriented Architecture (SOA) is adopted to address the mentioned missing features of distribution and interoperability by restructuring the current ETL framework. This research contributes towards the field of ETL by adding the distribution and inter- operability concepts to the ETL framework. This Ieads to contributions towards the area of data warehousing and business intelligence, because ETL is a core concept in this area. The Design Science Approach (DSA) and Scrum methodologies were adopted for achieving the research goals. The integration of DSA and Scrum provides the suitable methods for achieving the research objectives. The new ETL framework is realized by developing and testing a prototype that is based on the new ETL framework. This prototype is successfully evaluated using three case studies that are conducted using the data and tools of three different organizations. These organizations use data warehouse solutions for the purpose of generating statistical reports that help their top management to take decisions. Results of the case studies show that distribution and interoperability can be achieved by using the new ETL framework

    A BPMN-Based Design and Maintenance Framework for ETL Processes

    Get PDF
    Business Intelligence (BI) applications require the design, implementation, and maintenance of processes that extract, transform, and load suitable data for analysis. The development of these processes (known as ETL) is an inherently complex problem that is typically costly and time consuming. In a previous work, we have proposed a vendor-independent language for reducing the design complexity due to disparate ETL languages tailored to specific design tools with steep learning curves. Nevertheless, the designer still faces two major issues during the development of ETL processes: (i) how to implement the designed processes in an executable language, and (ii) how to maintain the implementation when the organization data infrastructure evolves. In this paper, we propose a model-driven framework that provides automatic code generation capability and ameliorate maintenance support of our ETL language. We present a set of model-to-text transformations able to produce code for different ETL commercial tools as well as model-to-model transformations that automatically update the ETL models with the aim of supporting the maintenance of the generated code according to data source evolution. A demonstration using an example is conducted as an initial validation to show that the framework covering modeling, code generation and maintenance could be used in practice
    corecore