188 research outputs found

    Towards Principles for Structuring and Managing Very Large Semantic Multidimensional Data Models

    Get PDF
    The management of semantic multidimensional data models plays an important role during the phases of development and maintenance of data warehouse systems. Unfortunately, this is not done with the necessary stress by now. Reasons might be seen in the plethora of semantic notations or the insufficient tool support for multidimensional modeling. The paper on hand provides experiences gained within a project with an industry partner of the telecommunications industry. Their problem is a very huge data warehouse with more than 400 data cubes and several hundred key performance indicators. We developed a repository-based solution for managing the semantic data models. Our lessons learned show that especially for very large data models there has to be a repository based solution as well as a clear concept on how to break them up into their component pars. The aim of our principles is to increase the understandability as well as the maintainability of semantic multidimensional data models

    Automating the multidimensional design of data warehouses

    Get PDF
    Les experiències prèvies en l'àmbit dels magatzems de dades (o data warehouse), mostren que l'esquema multidimensional del data warehouse ha de ser fruit d'un enfocament híbrid; això és, una proposta que consideri tant els requeriments d'usuari com les fonts de dades durant el procés de disseny.Com a qualsevol altre sistema, els requeriments són necessaris per garantir que el sistema desenvolupat satisfà les necessitats de l'usuari. A més, essent aquest un procés de reenginyeria, les fonts de dades s'han de tenir en compte per: (i) garantir que el magatzem de dades resultant pot ésser poblat amb dades de l'organització, i, a més, (ii) descobrir capacitats d'anàlisis no evidents o no conegudes per l'usuari.Actualment, a la literatura s'han presentat diversos mètodes per donar suport al procés de modelatge del magatzem de dades. No obstant això, les propostes basades en un anàlisi dels requeriments assumeixen que aquestos són exhaustius, i no consideren que pot haver-hi informació rellevant amagada a les fonts de dades. Contràriament, les propostes basades en un anàlisi exhaustiu de les fonts de dades maximitzen aquest enfocament, i proposen tot el coneixement multidimensional que es pot derivar des de les fonts de dades i, conseqüentment, generen massa resultats. En aquest escenari, l'automatització del disseny del magatzem de dades és essencial per evitar que tot el pes de la tasca recaigui en el dissenyador (d'aquesta forma, no hem de confiar únicament en la seva habilitat i coneixement per aplicar el mètode de disseny elegit). A més, l'automatització de la tasca allibera al dissenyador del sempre complex i costós anàlisi de les fonts de dades (que pot arribar a ser inviable per grans fonts de dades).Avui dia, els mètodes automatitzables analitzen en detall les fonts de dades i passen per alt els requeriments. En canvi, els mètodes basats en l'anàlisi dels requeriments no consideren l'automatització del procés, ja que treballen amb requeriments expressats en llenguatges d'alt nivell que un ordenador no pot manegar. Aquesta mateixa situació es dona en els mètodes híbrids actual, que proposen un enfocament seqüencial, on l'anàlisi de les dades es complementa amb l'anàlisi dels requeriments, ja que totes dues tasques pateixen els mateixos problemes que els enfocament purs.En aquesta tesi proposem dos mètodes per donar suport a la tasca de modelatge del magatzem de dades: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Totes dues consideren els requeriments i les fonts de dades per portar a terme la tasca de modelatge i a més, van ser pensades per superar les limitacions dels enfocaments actuals.1. MDBE segueix un enfocament clàssic, en el que els requeriments d'usuari són coneguts d'avantmà. Aquest mètode es beneficia del coneixement capturat a les fonts de dades, però guia el procés des dels requeriments i, conseqüentment, és capaç de treballar sobre fonts de dades semànticament pobres. És a dir, explotant el fet que amb uns requeriments de qualitat, podem superar els inconvenients de disposar de fonts de dades que no capturen apropiadament el nostre domini de treball.2. A diferència d'MDBE, AMDO assumeix un escenari on es disposa de fonts de dades semànticament riques. Per aquest motiu, dirigeix el procés de modelatge des de les fonts de dades, i empra els requeriments per donar forma i adaptar els resultats generats a les necessitats de l'usuari. En aquest context, a diferència de l'anterior, unes fonts de dades semànticament riques esmorteeixen el fet de no tenir clars els requeriments d'usuari d'avantmà.Cal notar que els nostres mètodes estableixen un marc de treball combinat que es pot emprar per decidir, donat un escenari concret, quin enfocament és més adient. Per exemple, no es pot seguir el mateix enfocament en un escenari on els requeriments són ben coneguts d'avantmà i en un escenari on aquestos encara no estan clars (un cas recorrent d'aquesta situació és quan l'usuari no té clares les capacitats d'anàlisi del seu propi sistema). De fet, disposar d'uns bons requeriments d'avantmà esmorteeix la necessitat de disposar de fonts de dades semànticament riques, mentre que a l'inversa, si disposem de fonts de dades que capturen adequadament el nostre domini de treball, els requeriments no són necessaris d'avantmà. Per aquests motius, en aquesta tesi aportem un marc de treball combinat que cobreix tots els possibles escenaris que podem trobar durant la tasca de modelatge del magatzem de dades.Previous experiences in the data warehouse field have shown that the data warehouse multidimensional conceptual schema must be derived from a hybrid approach: i.e., by considering both the end-user requirements and the data sources, as first-class citizens. Like in any other system, requirements guarantee that the system devised meets the end-user necessities. In addition, since the data warehouse design task is a reengineering process, it must consider the underlying data sources of the organization: (i) to guarantee that the data warehouse must be populated from data available within the organization, and (ii) to allow the end-user discover unknown additional analysis capabilities.Currently, several methods for supporting the data warehouse modeling task have been provided. However, they suffer from some significant drawbacks. In short, requirement-driven approaches assume that requirements are exhaustive (and therefore, do not consider the data sources to contain alternative interesting evidences of analysis), whereas data-driven approaches (i.e., those leading the design task from a thorough analysis of the data sources) rely on discovering as much multidimensional knowledge as possible from the data sources. As a consequence, data-driven approaches generate too many results, which mislead the user. Furthermore, the design task automation is essential in this scenario, as it removes the dependency on an expert's ability to properly apply the method chosen, and the need to analyze the data sources, which is a tedious and timeconsuming task (which can be unfeasible when working with large databases). In this sense, current automatable methods follow a data-driven approach, whereas current requirement-driven approaches overlook the process automation, since they tend to work with requirements at a high level of abstraction. Indeed, this scenario is repeated regarding data-driven and requirement-driven stages within current hybrid approaches, which suffer from the same drawbacks than pure data-driven or requirement-driven approaches.In this thesis we introduce two different approaches for automating the multidimensional design of the data warehouse: MDBE (Multidimensional Design Based on Examples) and AMDO (Automating the Multidimensional Design from Ontologies). Both approaches were devised to overcome the limitations from which current approaches suffer. Importantly, our approaches consider opposite initial assumptions, but both consider the end-user requirements and the data sources as first-class citizens.1. MDBE follows a classical approach, in which the end-user requirements are well-known beforehand. This approach benefits from the knowledge captured in the data sources, but guides the design task according to requirements and consequently, it is able to work and handle semantically poorer data sources. In other words, providing high-quality end-user requirements, we can guide the process from the knowledge they contain, and overcome the fact of disposing of bad quality (from a semantical point of view) data sources.2. AMDO, as counterpart, assumes a scenario in which the data sources available are semantically richer. Thus, the approach proposed is guided by a thorough analysis of the data sources, which is properly adapted to shape the output result according to the end-user requirements. In this context, disposing of high-quality data sources, we can overcome the fact of lacking of expressive end-user requirements.Importantly, our methods establish a combined and comprehensive framework that can be used to decide, according to the inputs provided in each scenario, which is the best approach to follow. For example, we cannot follow the same approach in a scenario where the end-user requirements are clear and well-known, and in a scenario in which the end-user requirements are not evident or cannot be easily elicited (e.g., this may happen when the users are not aware of the analysis capabilities of their own sources). Interestingly, the need to dispose of requirements beforehand is smoothed by the fact of having semantically rich data sources. In lack of that, requirements gain relevance to extract the multidimensional knowledge from the sources.So that, we claim to provide two approaches whose combination turns up to be exhaustive with regard to the scenarios discussed in the literaturePostprint (published version

    Factors in the Design and Development of a Data Warehouse for Academic Data.

    Get PDF
    Data warehousing is a relatively new field in the realm of information technology, and current research centers primarily around data warehousing in business environments. As new as the field is in these environments, only recently have educational institutions begun to embark on data warehousing projects, and little research has been done regarding the special considerations and characteristics of academic data, and the complexity of analyzing such data. Educational institutions measure success very differently from business-oriented organizations, and the analyses that are meaningful in such environments pose very unique and intricate problems in data warehousing. This research describes the process of developing a data warehouse for a community college, focusing on issues specific to academic data

    Implementation of the multidimensional schemas integration method ORE

    Get PDF
    The goal of the project is the implementation of the semi-automatic method, named ORE, for creating multidimentional schemas for data warehouses by integrating information requirements in an iterative way

    Approach for testing the extract-transform-load process in data warehouse systems, An

    Get PDF
    2018 Spring.Includes bibliographical references.Enterprises use data warehouses to accumulate data from multiple sources for data analysis and research. Since organizational decisions are often made based on the data stored in a data warehouse, all its components must be rigorously tested. In this thesis, we first present a comprehensive survey of data warehouse testing approaches, and then develop and evaluate an automated testing approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in data warehousing. In the survey we present a classification framework that categorizes the testing and evaluation activities applied to the different components of data warehouses. These approaches include both dynamic analysis as well as static evaluation and manual inspections. The classification framework uses information related to what is tested in terms of the data warehouse component that is validated, and how it is tested in terms of various types of testing and evaluation approaches. We discuss the specific challenges and open problems for each component and propose research directions. The ETL process involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex one-to-one, many-to-one, and many-to-many transformations involving sources and targets that use different schemas, databases, and technologies. Since faulty implementations in any of the ETL steps can result in incorrect information in the target data warehouse, ETL processes must be thoroughly validated. In this thesis, we propose automated balancing tests that check for discrepancies between the data in the source databases and that in the target warehouse. Balancing tests ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process. First, we categorize and define a set of properties to be checked in balancing tests. We identify various types of discrepancies that may exist between the source and the target data, and formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Next, we automatically identify source-to-target mappings from ETL transformation rules provided in the specifications. We identify one-to-one, many-to-one, and many-to-many mappings for tables, records, and attributes involved in the ETL transformations. We automatically generate test assertions to verify the properties for balancing tests. We use the source-to-target mappings to automatically generate assertions corresponding to each property. The assertions compare the data in the target data warehouse with the corresponding data in the sources to verify the properties. We evaluate our approach on a health data warehouse that uses data sources with different data models running on different platforms. We demonstrate that our approach can find previously undetected real faults in the ETL implementation. We also provide an automatic mutation testing approach to evaluate the fault finding ability of our balancing tests. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse when faulty ETL scripts execute on mock source data

    Systematic construction of goal-oriented COTS taxonomies

    Get PDF
    El proceso de construir software a partir del ensamblaje e integración de soluciones de software pre-fabricadas, conocidas como componentes COTS (Comercial-Off-The-Shelf) se ha convertido en una necesidad estratégica en una amplia variedad de áreas de aplicación. En general, los componentes COTS son componentes de software que proveen una funcionalidad específica, que están disponibles en el mercado para ser adquiridos e integrados dentro de otros sistemas de software. Los beneficios potenciales de esta tecnología son principalmente la reducción de costes y el acortamiento del tiempo de desarrollo, a la vez que fomenta la calidad. Sin embargo, numerosos retos que van desde problemas técnicos y legales deben ser afrontados para adaptar las actividades tradicionales de ingeniería de software para explotar los beneficios del uso de COTS para el desarrollo de sistemas.Actualmente, existe un incrementalmente enorme mercado de componentes COTS; así, una de las actividades más críticas en el desarrollo de sistemas basados en COTS es la selección de componentes que deben ser integrados en el sistema a desarrollar. La selección está básicamente compuesta de dos procesos principales: La búsqueda de componentes candidatos en el mercado y su posterior evaluación con respecto a los requisitos del sistema. Desafortunadamente, la mayoría de los métodos existentes para seleccionar COTS, se enfocan en el proceso de evaluación, dejando de lado el problema de buscar los componentes en el mercado. La búsqueda de componentes en el mercado no es una tarea trivial, teniendo que afrontar varias características del mercado de COTS, tales como su naturaleza dispersa y siempre creciente, cambio y evolución constante; en este contexto, la obtención de información de calidad acerca de los componentes no es una tarea fácil. Como consecuencia, el proceso de selección de COTS se ve seriamente dañado. Además, las alternativas tradicionales de reuso también carecen de soluciones apropiadas para reusar componentes COTS y el conocimiento adquirido en cada proceso de selección. Esta carencia de propuestas es un problema muy serio que incrementa los riesgos de los proyectos de selección de COTS, además de hacerlos ineficientes y altamente costosos. Esta disertación presenta el método GOThIC (Goal- Oriented Taxonomy and reuse Infrastructure Construction) enfocado a la construcción de infraestructuras de reuso para facilitar la búsqueda y reuso de componentes COTS. El método está basado en el uso de objetivos para construir taxonomías abstractas, bien fundamentadas y estables para lidiar con las características del mercado de COTS. Los nodos de las taxonomías son caracterizados por objetivos, sus relaciones son declaradas como dependencias y varios artefactos son construidos y gestionados para promover la reusabilidad y lidiar con la evolución constante.El método GOThIC ha sido elaborado a través de un proceso iterativo de investigación-acción para identificar los retos reales relacionados con el proceso de búsqueda de COTS. Posteriormente, las soluciones posibles fueron evaluadas e implementadas en varios casos de estudio en el ámbito industrial y académico en diversos dominios. Los resultados más relevantes fueron registrados y articulados en el método GOThIC. La evaluación industrial preliminar del método se ha llevado a cabo en algunas compañías en Noruega.The process of building software systems by assembling and integrating pre-packaged solutions in the form of Commercial-Off-The-Shelf (COTS) software components has become a strategic need in a wide variety of application areas. In general, COTS components are software components that provide a specific functionality, available in the market to be purchased, interfaced and integrated into other software systems. The potential benefits of this technology are mainly its reduced costs and shorter development time, while maintaining the quality. Nevertheless, many challenges ranging form technical to legal issues must be faced for adapting the traditional software engineering activities in order to exploit these benefits.Nowadays there is an increasingly huge marketplace of COTS components; therefore, one of the most critical activities in COTS-based development is the selection of the components to be integrated into the system under development. Selection is basically composed of two main processes, namely: searching of candidates from the marketplace and their evaluation with respect to the system requirements. Unfortunately, most of the different existing methods for COTS selection focus their efforts on evaluation, letting aside the problem of searching components in the marketplace. Searching candidate COTS is not an easy task, having to cope with some challenging marketplace characteristics related to its widespread, evolvable and growing nature; and the lack of available and well-suited information to obtain a quality-assured search. Indeed, traditional reuse approaches also lack of appropriate solutions to reuse COTS components and the knowledge gained in each selection process. This lack of proposals is a serious drawback that makes the whole selection process highly risky, and often expensive and inefficient. This dissertation introduces the GOThIC (Goal- Oriented Taxonomy and reuse Infrastructure Construction) method aimed at building a domain reuse infrastructure for facilitating COTS components searching and reuse. It is based on goal-oriented approaches for building abstract, well-founded and stable taxonomies capable of dealing with the COTS marketplace characteristics. Thus, the nodes of these taxonomies are characterized by means of goals, their relationships declared as dependencies among them and several artifacts are constructed and managed for reusability and evolution purposes. The GOThIC method has been elaborated following an iterative process based on action research premises to identify the actual challenges related to COTS components searching. Then, possible solutions were envisaged and implemented by several industrial and academic case studies in different domains. Successful results were recorded to articulate the synergic GOThIC method solution, followed by its preliminary industrial evaluation in some Norwegian companies

    A framework for the analysis and evaluation of enterprise models

    Get PDF
    Bibliography: leaves 264-288.The purpose of this study is the development and validation of a comprehensive framework for the analysis and evaluation of enterprise models. The study starts with an extensive literature review of modelling concepts and an overview of the various reference disciplines concerned with enterprise modelling. This overview is more extensive than usual in order to accommodate readers from different backgrounds. The proposed framework is based on the distinction between the syntactic, semantic and pragmatic model aspects and populated with evaluation criteria drawn from an extensive literature survey. In order to operationalize and empirically validate the framework, an exhaustive survey of enterprise models was conducted. From this survey, an XML database of more than twenty relatively large, publicly available enterprise models was constructed. A strong emphasis was placed on the interdisciplinary nature of this database and models were drawn from ontology research, linguistics, analysis patterns as well as the traditional fields of data modelling, data warehousing and enterprise systems. The resultant database forms the test bed for the detailed framework-based analysis and its public availability should constitute a useful contribution to the modelling research community. The bulk of the research is dedicated to implementing and validating specific analysis techniques to quantify the various model evaluation criteria of the framework. The aim for each of the analysis techniques is that it can, where possible, be automated and generalised to other modelling domains. The syntactic measures and analysis techniques originate largely from the disciplines of systems engineering, graph theory and computer science. Various metrics to measure model hierarchy, architecture and complexity are tested and discussed. It is found that many are not particularly useful or valid for enterprise models. Hence some new measures are proposed to assist with model visualization and an original "model signature" consisting of three key metrics is proposed.Perhaps the most significant contribution ofthe research lies in the development and validation of a significant number of semantic analysis techniques, drawing heavily on current developments in lexicography, linguistics and ontology research. Some novel and interesting techniques are proposed to measure, inter alia, domain coverage, model genericity, quality of documentation, perspicuity and model similarity. Especially model similarity is explored in depth by means of various similarity and clustering algorithms as well as ways to visualize the similarity between models. Finally, a number of pragmatic analyses techniques are applied to the models. These include face validity, degree of use, authority of model author, availability, cost, flexibility, adaptability, model currency, maturity and degree of support. This analysis relies mostly on the searching for and ranking of certain specific information details, often involving a degree of subjective interpretation, although more specific quantitative procedures are suggested for some of the criteria. To aid future researchers, a separate chapter lists some promising analysis techniques that were investigated but found to be problematic from methodological perspective. More interestingly, this chapter also presents a very strong conceptual case on how the proposed framework and the analysis techniques associated vrith its various criteria can be applied to many other information systems research areas. The case is presented on the grounds of the underlying isomorphism between the various research areas and illustrated by suggesting the application of the framework to evaluate web sites, algorithms, software applications, programming languages, system development methodologies and user interfaces
    • …
    corecore