7 research outputs found

    Complex adaptive systems based data integration : theory and applications

    Get PDF
    Data Definition Languages (DDLs) have been created and used to represent data in programming languages and in database dictionaries. This representation includes descriptions in the form of data fields and relations in the form of a hierarchy, with the common exception of relational databases where relations are flat. Network computing created an environment that enables relatively easy and inexpensive exchange of data. What followed was the creation of new DDLs claiming better support for automatic data integration. It is uncertain from the literature if any real progress has been made toward achieving an ideal state or limit condition of automatic data integration. This research asserts that difficulties in accomplishing integration are indicative of socio-cultural systems in general and are caused by some measurable attributes common in DDLs. This research’s main contributions are: (1) a theory of data integration requirements to fully support automatic data integration from autonomous heterogeneous data sources; (2) the identification of measurable related abstract attributes (Variety, Tension, and Entropy); (3) the development of tools to measure them. The research uses a multi-theoretic lens to define and articulate these attributes and their measurements. The proposed theory is founded on the Law of Requisite Variety, Information Theory, Complex Adaptive Systems (CAS) theory, Sowa’s Meaning Preservation framework and Zipf distributions of words and meanings. Using the theory, the attributes, and their measures, this research proposes a framework for objectively evaluating the suitability of any data definition language with respect to degrees of automatic data integration. This research uses thirteen data structures constructed with various DDLs from the 1960\u27s to date. No DDL examined (and therefore no DDL similar to those examined) is designed to satisfy the law of requisite variety. No DDL examined is designed to support CAS evolutionary processes that could result in fully automated integration of heterogeneous data sources. There is no significant difference in measures of Variety, Tension, and Entropy among DDLs investigated in this research. A direction to overcome the common limitations discovered in this research is suggested and tested by proposing GlossoMote, a theoretical mathematically sound description language that satisfies the data integration theory requirements. The DDL, named GlossoMote, is not merely a new syntax, it is a drastic departure from existing DDL constructs. The feasibility of the approach is demonstrated with a small scale experiment and evaluated using the proposed assessment framework and other means. The promising results require additional research to evaluate GlossoMote’s approach commercial use potential

    Foundations research in information retrieval inspired by quantum theory

    Get PDF
    In the information age information is useless unless it can be found and used, search engines in our time thereby form a crucial component of research. For something so crucial, information retrieval (IR), the formal discipline investigating search, can be a confusing area of study. There is an underlying difficulty, with the very definition of information retrieval, and weaknesses in its operational method, which prevent it being called a 'science'. The work in this thesis aims to create a formal definition for search, scientific methods for evaluation and comparison of different search strategies, and methods for dealing with the uncertainty associated with user interactions; so that one has the necessary formal foundation to be able to perceive IR as "search science". The key problems restricting a science of search pertain to the ambiguity in the current way in which search scenarios and concepts are specified. This especially affects evaluation of search systems since according to the traditional retrieval approach, evaluations are not repeatable, and thus not collectively verifiable. This is mainly due to the dependence on the method of user studies currently dominating evaluation methodology. This evaluation problem is related to the problem of not being able to formally define the users in user studies. The problem of defining users relates in turn to one of the main retrieval-specific motivations of the thesis, which can be understood by noticing that uncertainties associated with the interpretation of user interactions are collectively inscribed in a relevance concept, the representation and use of which defines the overall character of a retrieval model. Current research is limited in its understanding of how to best model relevance, a key factor restricting extensive formalization of the IR discipline as a whole. Thus, the problems of defining search systems and search scenarios are the principle issues preventing formal comparisons of systems and scenarios, in turn limiting the strength of experimental evaluation. Alternative models of search are proposed that remove the need for ambiguous relevance concepts and instead by arguing for use of simulation as a normative evaluation strategy for retrieval, some new concepts are introduced that can be employed in judging effectiveness of search systems. Included are techniques for simulating search, techniques for formal user modelling and techniques for generating measures of effectiveness for search models. The problems of evaluation and of defining users are generalized by proposing that they are related to the need for an unified framework for defining arbitrary search concepts, search systems, user models, and evaluation strategies. It is argued that this framework depends on a re-interpretation of the concept of search accommodating the increasingly embedded and implicit nature of search on modern operating systems, internet and networks. The re-interpretation of the concept of search is approached by considering a generalization of the concept of ostensive retrieval producing definitions of search, information need, user and system that (formally) accommodates the perception of search as an abstract process that can be physical and/or computational. The feasibility of both the mathematical formalism and physical conceptualizations of quantum theory (QT) are investigated for the purpose of modelling the this abstract search process as a physical process. Techniques for representing a search process by the Hilbert space formalism in QT are presented from which techniques are proposed for generating measures for effectiveness that combine static information such as term weights, and dynamically changing information such as probabilities of relevance. These techniques are used for deducing methods for modelling information need change. In mapping the 'macro level search' process to 'micro level physics' some generalizations were made to the use and interpretation of basic QT concepts such the wave function description of state and reversible evolution of states corresponding to the first and second postulates of quantum theory respectively. Several ways of expressing relevance (and other retrieval concepts) within the derived framework are proposed arguing that the increase in modelling power by use of QT provides effective ways to characterize this complex concept. Mapping the mathematical formalism of search to that of quantum theory presented insightful perspectives about the nature of search. However, differences between the operational semantics of quantum theory and search restricted the usefulness of the mapping. In trying to resolve these semantic differences, a semi-formal framework was developed that is mid-way between a programmatic language, a state-based language resembling the way QT models states, and a process description language. By using this framework, this thesis attempts to intimately link the theory and practice of information retrieval and the evaluation of the retrieval process. The result is a novel, and useful way for formally discussing, modelling and evaluating search concepts, search systems and search processes

    Requirement-based Root Cause Analysis Using Log Data

    Get PDF
    Root Cause Analysis for software systems is a challenging diagnostic task due to complexity emanating from the interactions between system components. Furthermore, the sheer size of the logged data makes it often difficult for human operators and administrators to perform problem diagnosis and root cause analysis. The diagnostic task is further complicated by the lack of models that could be used to support the diagnostic process. Traditionally, this diagnostic task is conducted by human experts who create mental models of systems, in order to generate hypotheses and conduct the analysis even in the presence of incomplete logged data. A challenge in this area is to provide the necessary concepts, tools, and techniques for the operators to focus their attention to specific parts of the logged data and ultimately to automate the diagnostic process. The work described in this thesis aims at proposing a framework that includes techniques, formalisms, and algorithms aimed at automating the process of root cause analysis. In particular, this work uses annotated requirement goal models to represent the monitored systems' requirements and runtime behavior. The goal models are used in combination with log data to generate a ranked set of diagnostics that represent the combination of tasks that failed leading to the observed failure. In addition, the framework uses a combination of word-based and topic-based information retrieval techniques to reduce the size of log data by filtering out a subset of log data to facilitate the diagnostic process. The process of log data filtering and reduction is based on goal model annotations and generates a sequence of logical literals that represent the possible systems' observations. A second level of investigation consists of looking for evidence for any malicious (i.e., intentionally caused by a third party) activity leading to task failures. This analysis uses annotated anti-goal models that denote possible actions that can be taken by an external user to threaten a given system task. The framework uses a novel probabilistic approach based on Markov Logic Networks. Our experiments show that our approach improves over existing proposals by handling uncertainty in observations, using natively generated log data, and by providing ranked diagnoses. The proposed framework has been evaluated using a test environment based on commercial off-the-shelf software components, publicly available Java Based ATM machine, and the large publicly available dataset (DARPA 2000)

    Variable latent semantic indexing

    No full text
    Latent Semantic Indexing is a classical method to produce optimal low-rank approximations of a term-document matrix. However, in the context of a particular query distribution, the approximation thus produced need not be optimal. We propose VLSI, a new query-dependent (or “variable”) low-rank approximation that minimizes approximation error for any specified query distribution. With this tool, it is possible to tailor the LSI technique to particular settings, often resulting in vastly improved approximations at much lower dimensionality. We validate this method via a series of experiments on classical corpora, showing that VLSI typically performs similarly to LSI with an order of magnitude fewer dimensions. Categories and Subject Descriptors G.1.3 [Numerical Analysis]: Numerical Linear Algebra— Singular value decomposition; G.1.3 [Numerical Analysis]

    Variable latent semantic indexing

    No full text
    corecore