80,437 research outputs found

    Striving towards Near Real-Time Data Integration for Data Warehouses

    Full text link
    Abstract. The amount of information available to large-scale enterprises is growing rapidly. While operational systems are designed to meet well-specified (short) response time requirements, the focus of data warehouses is generally the strategic analysis of business data integrated from heterogeneous source systems. The decision making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in time. A real-time data warehouse aims at decreasing the time it takes to make business decisions and tries to attain zero latency between the cause and effect of a business decision. In this paper we present an architecture of an ETL environment for real-time data warehouses, which supports a continual near real-time data propagation. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. Instead of using vendor proprietary ETL (extraction, transformation, loading) solutions, which are often hard to scale and often do not support an optimization of allocated time frames for data extracts, we propose in our approach ETLets (spoken “et-lets”) and Enterprise Java Beans (EJB) for the ETL processing tasks. 1

    Container-Managed ETL Applications for Integrating Data in Near Real-Time

    Get PDF
    As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced “et-lets”) for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches

    Transparent Forecasting Strategies in Database Management Systems

    Get PDF
    Whereas traditional data warehouse systems assume that data is complete or has been carefully preprocessed, increasingly more data is imprecise, incomplete, and inconsistent. This is especially true in the context of big data, where massive amount of data arrives continuously in real-time from vast data sources. Nevertheless, modern data analysis involves sophisticated statistical algorithm that go well beyond traditional BI and, additionally, is increasingly performed by non-expert users. Both trends require transparent data mining techniques that efficiently handle missing data and present a complete view of the database to the user. Time series forecasting estimates future, not yet available, data of a time series and represents one way of dealing with missing data. Moreover, it enables queries that retrieve a view of the database at any point in time - past, present, and future. This article presents an overview of forecasting techniques in database management systems. After discussing possible application areas for time series forecasting, we give a short mathematical background of the main forecasting concepts. We then outline various general strategies of integrating time series forecasting inside a database and discuss some individual techniques from the database community. We conclude this article by introducing a novel forecasting-enabled database management architecture that natively and transparently integrates forecast models

    DualTable: A Hybrid Storage Model for Update Optimization in Hive

    Full text link
    Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

    Building data warehouses in the era of big data: an approach for scalable and flexible big data warehouses

    Get PDF
    During the last few years, the concept of Big Data Warehousing gained significant attention from the scientific community, highlighting the need to make design changes to the traditional Data Warehouse (DW) due to its limitations, in order to achieve new characteristics relevant in Big Data contexts (e.g., scalability on commodity hardware, real-time performance, and flexible storage). The state-of-the-art in Big Data Warehousing reflects the young age of the concept, as well as ambiguity and the lack of common approaches to build Big Data Warehouses (BDWs). Consequently, an approach to design and implement these complex systems is of major relevance to business analytics researchers and practitioners. In this tutorial, the design and implementation of BDWs is targeted, in order to present a general approach that researchers and practitioners can follow in their Big Data Warehousing projects, exploring several demonstration cases focusing on system design and data modelling examples in areas like smart cities, retail, finance, manufacturing, among others

    Index Strukturen für Data Warehouse

    Get PDF
    0 Title and Table of Contents i 1\. Introduction 1 2\. State of the Art of Data Warehouse Research 5 3\. Data Storage and Index Structures 15 4\. Mixed Integer Problems for Finding Optimal Tree-based Index Structures 35 5\. Aggregated Data in Tree-Based Index Structures 43 6\. Performance Models for Tree-Based Index Structures 63 7\. Techniques for Comparing Index Structures 89 8\. Conclusion and Outlook 113 Bibliographie 116 Appendix 125This thesis investigates which index structures support query processing in typical data warehouse environments most efficiently. Data warehouse applications differ significantly from traditional transaction-oriented operational applications. Therefore, the techniques applied in transaction- oriented systems cannot be used in the context of data warehouses and new techniques must be developed. The thesis shows that the time complexity for the computation of optimal tree-based index structures prohibits its use in real world applications. Therefore, we improve heuristic techniques (e. g., R*-tree) to process range queries on aggregated data more efficiently. Experiments show the benefits of this approach for different kinds of typical data warehouse queries. Performance models estimate the behavior of standard index structures and the behavior of the extended index structures. We introduce a new model that considers the distribution of data. We show experimentally that the new model is more precise than other models known from literature. Two techniques compare two tree-based index structures with two bitmap indexing techniques. The performance of these index structures depends on a set of different parameters. Our results show which index structure performs most efficiently depending on the parameters.In dieser Arbeit wird untersucht, welche Indexstrukturen Anfragen in typischen Data Warehouse Systemen effizient unterstützen. Indexstrukturen, seit mehr als zwanzig Jahren Forschungsgegenstand im Datenbankbereich, wurden in der Vergangenheit für transaktionsorientierte Systeme optimiert. Ein Kennzeichen dieser Systeme ist die effiziente Unterstützung von Einfüge-, Änderungs- und Löschoperationen auf einzelnen Datensätzen. Typische Operationen in Data Warehouse Systemen sind dagegen komplexe Anfragen auf großen relativ statischen Datenmengen. Aufgrund dieser veränderten Anforderungen müssen Datenbankmanagementsysteme, die für Data Warehouses eingesetzt werden, andere Techniken nutzen, um komplexe Anfragen effizient zu unterstützen. Zunächst wird ein Ansatz untersucht, der mit Hilfe eines gemischt ganzzahligen Optimierungsproblems eine optimale Indexstruktur berechnet. Da die Kosten für die Berechnung dieser optimalen Indexstruktur mit der Anzahl der zu indizierenden Datensätze exponentiell steigen, wird in anschließenden Teilen der Arbeit heuristischen Ansätzen nachgegangen, die mit der Größe der zu indizierenden Datensätze skalieren. Ein Ansatz erweitert auf Bäumen basierende Indexstrukturen um aggregierte Daten in den inneren Knoten. Experimentell wird gezeigt, daß mit Hilfe der materialisierten Zwischenergebnisse in den inneren Knoten Bereichsanfragen auf aggregierten Daten wesentlich schneller bearbeitet werden. Um das Leistungsverhalten von Indexstrukturen mit und ohne materialisierte Zwischenergebnisse zu untersuchen, wird das PISA Modell (Performance of Index Structures with and without Aggregated Data) entwickelt. In diesem Modell wird die Verteilung der Daten und die Verteilung der Anfragen berücksichtigt. Das PISA Modell wird an gleich-, schief- und normalverteilte Datensätze angepaßt. Experimentell wird gezeigt, daß das PISA Modell mit einer höheren Präzision als die bisher aus der Literatur bekannten Modelle arbeitet. Die Leistung von Indexstrukturen hängt von unterschiedlichen Parametern ab. In dieser Arbeit werden zwei Techniken vorgestellt, die abhängig von einer bestimmten Menge von Parametern Indexstrukturen vergleichen. Mit Hilfe von Klassifikationsbäumen wird z. B. gezeigt, daß die Blockgröße die relative Leistung weniger beeinflußt als andere Parameter. Ein weiteres Ergebnis ist, daß Bitmap-Indexstrukturen von den Verbesserungen neuerer Sekundärspeicher stärker profitieren als heute übliche auf Bäumen basierende Indexstrukturen. Bitmap-Indexierungstechniken bieten noch ein großes Potential für weitere Leistungssteigerungen im Datenbankbereich

    Business intelligence in modern banking

    Get PDF
    Business intelligence represents the process of collecting all the available and important external data and their transformation into useful ones that help each bank management with making business decisions. In modern banking, the system of business intelligence enables multimedia analyze, on-line analytic data processing as wel as Data Mining which can be used by bank man-agers in order to get and learn important trends that are “hidden” in big data bases. Apart the others, integral parts of business intelligence are Data Warehouse, executive and informational systems, on-line analityc data processing and Balanced Scorecard (BSC) implementation. Among the most important goals of business intelligence is identifcation and anticipation of real favorites and bad circumstances in business bank environment. Quality architecture of the environment of bank systems for support should include the trinty: Data Warehouse, OLAP and Data Mining. Business intelligence values should be observed from the point of modern understanding of managing and making decisions. Business banks which are able to manage their data resources, information and knowledge are more successful than their competitors. Business banks have a lot of information resources, but real challenge is to know to collect the information in a defnite time period, from the appropriate category of clients. The main idea of CRM is not any more going in for products and services but for their clients. Today it has become possible by development of data bases where saved data about specifc clients are put, as well as software that enables optimal usage of those data. Studying the clients represents the base of CRM and it is the information of bank client inte-raction that results in the possibility for making stabile profitable relations with clients. The concept of electric business intelligence as its main support has a signifcant importance for developing of CRM in business banking. Therefore, business banks, which are oriented to traditional managing way, become uncompetitive in a very complex capital of bank market

    Using Ontologies for the Design of Data Warehouses

    Get PDF
    Obtaining an implementation of a data warehouse is a complex task that forces designers to acquire wide knowledge of the domain, thus requiring a high level of expertise and becoming it a prone-to-fail task. Based on our experience, we have detected a set of situations we have faced up with in real-world projects in which we believe that the use of ontologies will improve several aspects of the design of data warehouses. The aim of this article is to describe several shortcomings of current data warehouse design approaches and discuss the benefit of using ontologies to overcome them. This work is a starting point for discussing the convenience of using ontologies in data warehouse design.Comment: 15 pages, 2 figure

    Feasibility of Warehouse Drone Adoption and Implementation

    Get PDF
    While aerial delivery drones capture headlines, the pace of adoption of drones in warehouses has shown the greatest acceleration. Warehousing constitutes 30% of the cost of logistics in the US. The rise of e-commerce, greater customer service demands of retail stores, and a shortage of skilled labor have intensified competition for efficient warehouse operations. This takes place during an era of shortening technology life cycles. This paper integrates several theoretical perspectives on technology diffusion and adoption to propose a framework to inform supply chain decision-makers on when to invest in new robotics technology
    corecore