1,330 research outputs found

    GCIP: Exploiting the Generation and Optimization of Integration Processes

    Get PDF
    As a result of the changing scope of data management towards the management of highly distributed systems and applications, integration processes have gained in importance. Such integration processes represent an abstraction of workflow-based integration tasks. In practice, integration processes are pervasive and the performance of complete IT infrastructures strongly depends on the performance of the central integration platform that executes the specified integration processes. In this area, the three major problems are: (1) significant development efforts, (2) low portability, and (3) inefficient execution. To overcome those problems, we follow a model-driven generation approach for integration processes. In this demo proposal, we want to introduce the so-called GCIP Framework (Generation of Complex Integration Processes) which allows the modeling of integration process and the generation of different concrete integration tasks. The model-driven approach opens opportunities for rule-based and workload-based optimization techniques

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Full text link
    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

    Integrating Online Wire Transfer Fraud Data with Suspicious Wire Transfer Data Using SSIS

    Get PDF
    The client is one of the banking systems which is in the process of providing online wire transfers to the customers. This project is about how the suspicious wire transfer data and confirmed wire transfer fraud data are handled. Currently both data are entered through two different applications and are stored separately. Since it is important for any business to make analysis how their business is running to make further decisions the data from these two applications should be integrated. This can be achieved by using SSIS (SQL Server Integration Services)

    Cost-Based Optimization of Integration Flows

    Get PDF
    Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation

    A Goal and Ontology Based Approach for Generating ETL Process Specifications

    Get PDF
    Data warehouse (DW) systems development involves several tasks such as defining requirements, designing DW schemas, and specifying data transformation operations. Indeed, the success of DW systems is very much dependent on the proper design of the extracting, transforming, and loading (ETL) processes. However, the common design-related problems in the ETL processes such as defining user requirements and data transformation specifications are far from being resolved. These problems are due to data heterogeneity in data sources, ambiguity of user requirements, and the complexity of data transformation activities. Current approaches have limitations on the reconciliation of DW requirement semantics towards designing the ETL processes. As a result, this has prolonged the process of the ETL processes specifications generation. The semantic framework of DW systems established from this study is used to develop the requirement analysis method for designing the ETL processes (RAMEPs) from the different perspectives of organization, decision-maker, and developer by using goal and ontology approaches. The correctness of RAMEPs approach was validated by using modified and newly developed compliant tools. The RAMEPs was evaluated in three real case studies, i.e., Student Affairs System, Gas Utility System, and Graduate Entrepreneur System. These case studies were used to illustrate how the RAMEPs approach can be implemented for designing and generating the ETL processes specifications. Moreover, the RAMEPs approach was reviewed by the DW experts for assessing the strengths and weaknesses of this method, and the new approach is accepted. The RAMEPs method proves that the ETL processes specifications can be derived from the early phases of DW systems development by using the goal-ontology approach

    A Domain Specific Model for Generating ETL Workflows from Business Intents

    Get PDF
    Extract-Transform-Load (ETL) tools have provided organizations with the ability to build and maintain workflows (consisting of graphs of data transformation tasks) that can process the flood of digital data. Currently, however, the specification of ETL workflows is largely manual, human time intensive, and error prone. As these workflows become increasingly complex, the users that build and maintain them must retain an increasing amount of knowledge specific to how to produce solutions to business objectives using their domain\u27s ETL workflow system. A program that can reduce the human time and expertise required to define such workflows, producing accurate ETL solutions with fewer errors would therefore be valuable. This dissertation presents a means to automate the specification of ETL workflows using a domain-specific modeling language. To provide such a solution, the knowledge relevant to the construction of ETL workflows for the operations and objectives of a given domain is identified and captured. The approach provides a rich model of ETL workflow capable of representing such knowledge. This knowledge representation is leveraged by a domain-specific modeling language which maps declarative statements into workflow requirements. Users are then provided with the ability to assertionally express the intents that describe a desired ETL solution at a high-level of abstraction, from which procedural workflows satisfying the intent specification are automatically generated using a planner

    Towards BIM/GIS interoperability: A theoretical framework and practical generation of spaces to support infrastructure Asset Management

    Get PDF
    The past ten years have seen the widespread adoption of Building Information Modelling (BIM) among both the Architectural, Engineering and Construction (AEC) and the Asset Management/ Facilities Management (AM/FM) communities. This has been driven by the use of digital information to support collaborative working and a vision for more efficient reuse of data. Within this context, spatial information is either held in a Geographic Information Systems (GIS) or as Computer-Aided Design (CAD) models in a Common Data Environment (CDE). However, these being heterogeneous systems, there are inevitable interoperability issues that result in poor integration. For this thesis, the interoperability challenges were investigated within a case study to ask: Can a better understanding of the conceptual and technical challenges to the integration of BIM and GIS provide improved support for the management of asset information in the context of a major infrastructure project? Within their respective fields, the terms BIM and GIS have acquired a range of accepted meanings, that do not align well with each other. A seven-level socio-technical framework is developed to harmonise concepts in spatial information systems. This framework is used to explore the interoperability gaps that must be resolved to enable design and construction information to be joined up with operational asset information. The Crossrail GIS and BIM systems were used to investigate some of the interoperability challenges that arise during the design, construction and operation of an infrastructure asset. One particular challenge concerns a missing link between AM-based information and CAD-based geometry which hinders engineering assets from being located within the geometric model and preventing geospatial analysis. A process is developed to link these CAD-based elements with AM-based assets using defined 3D spaces to locate assets. However, other interoperability challenges must first be overcome; firstly, the extraction, transformation and loading of geometry from CAD to GIS; secondly, the creation of an explicit representation of each 3D space from the implicit enclosing geometry. This thesis develops an implementation of the watershed transform algorithm to use real-world Crossrail geometry to generate voxelated interior spaces that can then be converted into a B-Rep mesh for use in 3D GIS. The issues faced at the technical level in this case study provide insight into the differences that must also be addressed at the conceptual level. With this in mind, this thesis develops a Spatial Information System Framework to classify the nature of differences between BIM, GIS and other spatial information systems

    Data generator for evaluating ETL process quality

    Get PDF
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

    Intership Report on data merging at the bank of Portugal Internship Experience at the Bank of Portugal: A Comprehensive Dive into Full Stack Development - Leveraging Modern Technology to Innovate Financial Infrastructure and Enhance User Experience

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis report details my full-stack development internship experiences at the Bank of Portugal, with a particular emphasis on the creation of a website intended to increase operational effectiveness in the DAS Department. My main contributions met a clear need, which was the absence of a reliable platform that could manage and combine data from many sources. I was actively involved in creating functionality for the Django applications Integrator and BAII using Django, a high-level Python web framework. Several problems were addressed by the distinctive features I planned and programmed, including daily data extraction from several SQL databases, entity error detection, data merging, and user-friendly interfaces for data manipulation. A feature that enables the attribution of litigation to certain entities was also developed. The outcomes of the developed features have proven to be useful, giving the Institutional Intervention Area, the Sanctioning Action Area, the Illicit Financial Activity Investigation Area, and the Money Laundering Preventive Supervision Area for Capital and Financing of Terrorism tools to carry out their duties more effectively. The full-stack development approaches' advancement and use in the banking industry, notably in data management and web application development, have been aided by this internship experience
    • …
    corecore