26 research outputs found

    Extending and inferring functional dependencies in schema transformation

    Full text link

    Scalable mining for classification rules in relational databases

    Get PDF
    doi:10.1214/lnms/1196285404Data mining is a process of discovering useful patterns (knowledge) hidden in extremely large datasets. Classification is a fundamental data mining function, and some other functions can be reduced to it. In this paper we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm. We have built a prototype of MIND in the relational database management system DB2 and have benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond dataset sizes previously published for classifiers.We also give some insights that may have an impact on the evolution of the extended relational calculus SQL

    On Resolving Semantic Heterogeneities and Deriving Constraints in Schema Integration

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    INCREMENTAL QUERY PROCESSING IN INFORMATION FUSION SYSTEMS

    Get PDF
    This dissertation studies the methodology and techniques of information retrieval in fusion systems where information referring to same objects is assessed on the basis of data from multiple heterogeneous data sources. A wide range of important applications can be categorized as information fusion systems e.g. multisensor surveillance system, local search system, multisource medical diagnose system, and so on. Up to the time of this dissertation, most information retrieval methods in fusion systems are highly domain specific, and most query systems do not address fusion problem with enough efforts. In this dissertation, I describe a broadly applicable query based information retrieval approach in general fusion systems: user information needs are interpreted as fusion queries, and the query processing techniques e.g. source dependence graph (SDG), query refinement and optimization are described. Aiming to remove the query building bottleneck, a novel incremental query method is proposed, which can eliminate the accumulated complexity in query building as well as in query execution. Query pattern is defined to capture and reuse repeated structures in the incremental queries. Several new techniques for query pattern matching and learning are described in detail. Some important experiments in a real-world multisensor fusion system, i.e. the intelligent vehicle tracking (IVET) system, have been presented to validate the proposed methodology and techniques

    Local Radiance

    Get PDF
    Recent years have seen a proliferation of web applications based on content management systems (CMS). Using a CMS, non-technical content authors are able to define custom content types to support their needs. These content type names and the attribute names in each content type are typically domain-specific and meaningful to the content authors. The ability of a CMS to support a multitude of content types allows for endless creation and customization but also leads to a large amount of heterogeneity within a single application. While this meaningful heterogeneity is beneficial, it introduces the problem of how to write reusable functionality (e.g., general purpose widgets) that can work across all the different types. Traditional information integration can solve the problem of schema heterogeneity by defining a single global schema that captures the shared semantics of the heterogeneous (local) schemas. Functionality and queries can then be written against the global schema and return data from local sources in the form of the global schema, but the meaningful local semantics (such as type and attribute names) are not returned. Mappings are also complex and require skilled developers to create. Here we propose a system that we call \textit{local radiance} (LR) that captures both global shared semantics as well as local, beneficial heterogeneity. We provide a formal definition of our system that includes domain structures---small, global schema fragments that represent shared domain-specific semantics--- and canonical structures---domain-independent global schema fragments used to build generic global widgets. We define mappings between local, domain, and canonical levels. Our query language extends the relational algebra to support queries that radiate local semantics to the domain and canonical levels as well as inserting and updating heterogeneous local data from generic global widgets. We characterize the expressive power of our mapping language and show how it can be used to perform complex data and metadata transformations. Through a user study, we evaluate the ability of non-technical users to perform mapping tasks and find that it is both understandable and usable. We report on the ongoing development (in CMSs and a relational database) of LR systems, demonstrate how widgets can be built using local radiance, and show how LR is being used in a number of online public educational repositories

    Reflective Model Driven Engineering

    Get PDF
    In many large organizations, the model transformations allowing the engineers to more or less automatically go from platformindependent models (PIM) to platform-specific models (PSM) are increasingly seen as vital assets. As tools evolve, it is critical that these transformations are not prisoners of a given CASE tool. Considering in this paper that a CASE tool can be seen as a platform for processing a model transformation, we propose to reflectively apply the MDA to itself. We propos

    Tree algorithms for mining association rules

    Get PDF
    With the increasing reliability of digital communication, the falling cost of hardware and increased computational power, the gathering and storage of data has become easier than at any other time in history. Commercial and public agencies are able to hold extensive records about all aspects of their operations. Witness the proliferation of point of sale (POS) transaction recording within retailing, digital storage of census data and computerized hospital records. Whilst the gathering of such data has uses in terms of answering specific queries and allowing visulisation of certain trends the volumes of data can hide significant patterns that would be impossible to locate manually. These patterns, once found, could provide an insight into customer behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected. Remaining competitive in a modem business environment, or delivering services in a timely and cost effective manner for public services is a crucial part of modem economics. Analysis of the data held by an organisaton, by a system that "learns" can allow predictions to be made based on historical evidence. Users may guide the process but essentially the software is exploring the data unaided. The research described within this thesis develops current ideas regarding the exploration of large data volumes. Particular areas of research are the reduction of the search space within the dataset and the generation of rules which are deduced from the patterns within the data. These issues are discussed within an experimental framework which extracts information from binary data

    Peer Data Management

    Get PDF
    Peer Data Management (PDM) deals with the management of structured data in unstructured peer-to-peer (P2P) networks. Each peer can store data locally and define relationships between its data and the data provided by other peers. Queries posed to any of the peers are then answered by also considering the information implied by those mappings. The overall goal of PDM is to provide semantically well-founded integration and exchange of heterogeneous and distributed data sources. Unlike traditional data integration systems, peer data management systems (PDMSs) thereby allow for full autonomy of each member and need no central coordinator. The promise of such systems is to provide flexible data integration and exchange at low setup and maintenance costs. However, building such systems raises many challenges. Beside the obvious scalability problem, choosing an appropriate semantics that can deal with arbitrary, even cyclic topologies, data inconsistencies, or updates while at the same time allowing for tractable reasoning has been an area of active research in the last decade. In this survey we provide an overview of the different approaches suggested in the literature to tackle these problems, focusing on appropriate semantics for query answering and data exchange rather than on implementation specific problems
    corecore