5,659 research outputs found

    Schema Independent Relational Learning

    Full text link
    Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions

    Generating Preview Tables for Entity Graphs

    Full text link
    Users are tapping into massive, heterogeneous entity graphs for many applications. It is challenging to select entity graphs for a particular need, given abundant datasets from many sources and the oftentimes scarce information for them. We propose methods to produce preview tables for compact presentation of important entity types and relationships in entity graphs. The preview tables assist users in attaining a quick and rough preview of the data. They can be shown in a limited display space for a user to browse and explore, before she decides to spend time and resources to fetch and investigate the complete dataset. We formulate several optimization problems that look for previews with the highest scores according to intuitive goodness measures, under various constraints on preview size and distance between preview tables. The optimization problem under distance constraint is NP-hard. We design a dynamic-programming algorithm and an Apriori-style algorithm for finding optimal previews. Results from experiments, comparison with related work and user studies demonstrated the scoring measures' accuracy and the discovery algorithms' efficiency.Comment: This is the camera-ready version of a SIGMOD16 paper. There might be tiny differences in layout, spacing and linebreaking, compared with the version in the SIGMOD16 proceedings, since we must submit TeX files and use arXiv to compile the file

    Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

    Get PDF
    The amount of Semantic Web data is growing rapidly today. Individual users, academic institutions and businesses have already published and are continuing to publish their data in Semantic Web standards, such as RDF and OWL. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. Furthermore, data published by each individual publisher may not be complete. This situation makes it difficult for end users to consume the available Semantic Web data effectively. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. In the Semantic Web, the owl:sameAs predicate is used to link two equivalent (coreferent) ontology instances. An important question is where these owl:sameAs links come from. Although manual interlinking is possible on small scales, when dealing with large-scale datasets (e.g., millions of ontology instances), automated linking becomes necessary. This dissertation summarizes contributions to several aspects of entity coreference research in the Semantic Web. First of all, by developing the EPWNG algorithm, we advance the performance of the state-of-the-art by 1% to 4%. EPWNG finds coreferent ontology instances from different data sources by comparing every pair of instances and focuses on achieving high precision and recall by appropriately collecting and utilizing instance context information domain-independently. We further propose a sampling and utility function based context pruning technique, which provides a runtime speedup factor of 30 to 75. Furthermore, we develop an on-the-fly candidate selection algorithm, P-EPWNG, that enables the coreference process to run 2 to 18 times faster than the state-of-the-art on up to 1 million instances while only making a small sacrifice in the coreference F1-scores. This is achieved by utilizing the matching histories of the instances to prune instance pairs that are not likely to be coreferent. We also propose Offline, another candidate selection algorithm, that not only provides similar runtime speedup to P-EPWNG but also helps to achieve higher candidate selection and coreference F1-scores due to its more accurate filtering of true negatives. Different from P-EPWNG, Offline pre-selects candidate pairs by only comparing their partial context information that is selected in an unsupervised, automatic and domain-independent manner.In order to be able to handle really heterogeneous datasets, a mechanism for automatically determining predicate comparability is proposed. Combing this property matching approach with EPWNG and Offline, our system outperforms state-of-the-art algorithms on the 2012 Billion Triples Challenge dataset on up to 2 million instances for both coreference F1-score and runtime. An interesting project, where we apply the EPWNG algorithm for assisting cervical cancer screening, is discussed in detail. By applying our algorithm to a combination of different patient clinical test results and biographic information, we achieve higher accuracy compared to its ablations. We end this dissertation with the discussion of promising and challenging future work

    Predicing Ecological Effects of Watershed-Wide Rain Garden Implementation Using a Low-Cost Methodology

    Get PDF
    Stormwater control measures (SCMs) have been employed to mitigate peak flows and pollutants ssociated with watershed urbanization. Downstream ecological effects caused by the implementation of SCMs are largely unknown, especially at the watershed scale. Knowledge of these effects could help with setting goals for and targeting locations of local restoration efforts. Unfortunately, studies such as these typically require a high level of time and effort for the investigating party, of which resources are often limited. This study proposes a low-cost investigation method for the prediction of ecological effects on the watershed scale with the implementation of rain garden systems by using publicly available data and software. For demonstration purposes, a typical urban watershed was modeled using Storm Water Management Model (SWMM) 5.0. Forty-five models were developed in which the percent impervious area was varied 3 to 80%, and the fraction of rain gardens implemented with respect to the number of structures was varied from to 100%. The river chub fish (Nocomis micropogon) and its congeners (Nocomis spp.) were chosen as ecological indicators, as they are considered to be keystone species through interspecific nesting association. Depth and velocity criteria for successful nest building locations of the river chub were determined; these criteria can then be applied to many other watersheds. In this study, both base flow conditions and a typical summer storm event (1.3 cm, 6 h duration) were evaluated. During the simulated storm, nest-building locations were not affected in the 3 and 5% impervious cover models. Nest destruction was found to occur in approximately 54% of the original nest building sites for the 9% and 10% impervious areas. Nearly all of the nest-building locations were uninhabitable for impervious areas 20% and greater. Rain garden implementation significantly improved river chub habitat in the simulation, with greatest marginal benefit at lower levels of implementation

    The Effect of Applying Design of Experiments Techniques to Software Performance Testing

    Get PDF
    Effective software performance testing is essential to the development and delivery of quality software products. Many software testing investigations have reported software performance testing improvements, but few have quantitatively validated measurable software testing performance improvements across an aggregate of studies. This study addressed that gap by conducting a meta-analysis to assess the relationship between applying Design of Experiments (DOE) techniques in the software testing process and the reported software performance testing improvements. Software performance testing theories and DOE techniques composed the theoretical framework for this study. Software testing studies (n = 96) were analyzed, where half had DOE techniques applied and the other half did not. Five research hypotheses were tested, where findings were measured in (a) the number of detected defects, (b) the rate of defect detection, (c) the phase in which the defect was detected, (d) the total number of hours it took to complete the testing, and (e) an overall hypothesis which included all measurements for all findings. The data were analyzed by first computing standard difference in means effect sizes, then through the Z test, the Q test, and the t test in statistical comparisons. Results of the meta-analysis showed that applying DOE techniques in the software testing process improved software performance testing (p \u3c 05). These results have social implications for the software testing industry and software testing professionals, providing another empirically-validated testing methodology. Software organizations can use this methodology to differentiate their software testing process, to create more quality products, and to benefit the consumer and society in general

    Query Rewriting and Optimization for Ontological Databases

    Full text link
    Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints which derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints which is well-suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process so to produce possibly small and cost-effective UCQ rewritings for an input query.Comment: arXiv admin note: text overlap with arXiv:1312.5914 by other author

    Model based test suite minimization using metaheuristics

    Get PDF
    Software testing is one of the most widely used methods for quality assurance and fault detection purposes. However, it is one of the most expensive, tedious and time consuming activities in software development life cycle. Code-based and specification-based testing has been going on for almost four decades. Model-based testing (MBT) is a relatively new approach to software testing where the software models as opposed to other artifacts (i.e. source code) are used as primary source of test cases. Models are simplified representation of a software system and are cheaper to execute than the original or deployed system. The main objective of the research presented in this thesis is the development of a framework for improving the efficiency and effectiveness of test suites generated from UML models. It focuses on three activities: transformation of Activity Diagram (AD) model into Colored Petri Net (CPN) model, generation and evaluation of AD based test suite and optimization of AD based test suite. Unified Modeling Language (UML) is a de facto standard for software system analysis and design. UML models can be categorized into structural and behavioral models. AD is a behavioral type of UML model and since major revision in UML version 2.x it has a new Petri Nets like semantics. It has wide application scope including embedded, workflow and web-service systems. For this reason this thesis concentrates on AD models. Informal semantics of UML generally and AD specially is a major challenge in the development of UML based verification and validation tools. One solution to this challenge is transforming a UML model into an executable formal model. In the thesis, a three step transformation methodology is proposed for resolving ambiguities in an AD model and then transforming it into a CPN representation which is a well known formal language with extensive tool support. Test case generation is one of the most critical and labor intensive activities in testing processes. The flow oriented semantic of AD suits modeling both sequential and concurrent systems. The thesis presented a novel technique to generate test cases from AD using a stochastic algorithm. In order to determine if the generated test suite is adequate, two test suite adequacy analysis techniques based on structural coverage and mutation have been proposed. In terms of structural coverage, two separate coverage criteria are also proposed to evaluate the adequacy of the test suite from both perspectives, sequential and concurrent. Mutation analysis is a fault-based technique to determine if the test suite is adequate for detecting particular types of faults. Four categories of mutation operators are defined to seed specific faults into the mutant model. Another focus of thesis is to improve the test suite efficiency without compromising its effectiveness. One way of achieving this is identifying and removing the redundant test cases. It has been shown that the test suite minimization by removing redundant test cases is a combinatorial optimization problem. An evolutionary computation based test suite minimization technique is developed to address the test suite minimization problem and its performance is empirically compared with other well known heuristic algorithms. Additionally, statistical analysis is performed to characterize the fitness landscape of test suite minimization problems. The proposed test suite minimization solution is extended to include multi-objective minimization. As the redundancy is contextual, different criteria and their combination can significantly change the solution test suite. Therefore, the last part of the thesis describes an investigation into multi-objective test suite minimization and optimization algorithms. The proposed framework is demonstrated and evaluated using prototype tools and case study models. Empirical results have shown that the techniques developed within the framework are effective in model based test suite generation and optimizatio

    Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

    Get PDF
    De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process
    • …
    corecore