412 research outputs found

    Canonical queries as a query answering device (Information Science)

    Get PDF
    Issued as Annual reports [nos. 1-2], and Final report, Project no. G-36-60

    Online Data Cleaning

    Get PDF
    Data-centric applications have never been more ubiquitous in our lives, e.g., search engines, route navigation and social media. This has brought along a new age where digital data is at the core of many decisions we make as individuals, e.g., looking for the most scenic route to plan a road trip, or as professionals, e.g., analysing customers’ transactions to predict the best time to restock different products. However, the surge in data generation has also led to creating massive amounts of dirty data, i.e., inaccurate or redundant data. Using dirty data to inform business decisions comes with dire consequences, for instance, an IBM report estimates that dirty data costs the U.S. $3.1 trillion a year. Dirty data is the product of many factors which include data entry errors and integration of several data sources. Data integration of multiple sources is especially prone to producing dirty data. For instance, while individual sources may not have redundant data, they often carry redundant data across each other. Furthermore, different data sources may obey different business rules (sometimes not even known) which makes it challenging to reconcile the integrated data. Even if the data is clean at the time of the integration, data updates would compromise its quality over time. There is a wide spectrum of errors that can be found in the data, e,g, duplicate records, missing values, obsolete data, etc. To address these problems, several data cleaning efforts have been proposed, e.g., record linkage to identify duplicate records, data fusion to fuse duplicate data items into a single representation and enforcing integrity constraints on the data. However, most existing efforts make two key assumptions: (1) Data cleaning is done in one shot; and (2) The data is available in its entirety. Those two assumptions do not hold in our age where data is highly volatile and integrated from several sources. This calls for a paradigm shift in approaching data cleaning: it has to be made iterative where data comes in chunks and not all at once. Consequently, cleaning the data should not be repeated from scratch whenever the data changes, but instead, should be done only for data items affected by the updates. Moreover, the repair should be computed effciently to support applications where cleaning is performed online (e.g. query time data cleaning). In this dissertation, we present several proposals to realize this paradigm for two major types of data errors: duplicates and integrity constraint violations. We frst present a framework that supports online record linkage and fusion over Web databases. Our system processes queries posted to Web databases. Query results are deduplicated, fused and then stored in a cache for future reference. The cache is updated iteratively with new query results. This effort makes it possible to perform record linkage and fusion effciently, but also effectively, i.e., the cache contains data items seen in previous queries which are jointly cleaned with incoming query results. To address integrity constraints violations, we propose a novel way to approach Functional Dependency repairs, develop a new class of repairs and then demonstrate it is superior to existing efforts, in runtime and accuracy. We then show how our framework can be easily tuned to work iteratively to support online applications. We implement a proof-ofconcept query answering system to demonstrate the iterative capability of our system

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn

    Relaxing and Restraining Queries for OBDA

    Get PDF
    In ontology-based data access (OBDA), ontologies have been successfully employed for querying possibly unstructured and incomplete data. In this paper, we advocate using ontologies not only to formulate queries and compute their answers, but also for modifying queries by relaxing or restraining them, so that they can retrieve either more or less answers over a given dataset. Towards this goal, we first illustrate that some domain knowledge that could be naturally leveraged in OBDA can be expressed using complex role inclusions (CRI). Queries over ontologies with CRI are not first-order (FO) rewritable in general. We propose an extension of DL-Lite with CRI, and show that conjunctive queries over ontologies in this extension are FO rewritable. Our main contribution is a set of rules to relax and restrain conjunctive queries (CQs). Firstly, we define rules that use the ontology to produce CQs that are relaxations/restrictions over any dataset. Secondly, we introduce a set of data-driven rules, that leverage patterns in the current dataset, to obtain more fine-grained relaxations and restrictions

    Adaptive runtime techniques for power and resource management on multi-core systems

    Full text link
    Energy-related costs are among the major contributors to the total cost of ownership of data centers and high-performance computing (HPC) clusters. As a result, future data centers must be energy-efficient to meet the continuously increasing computational demand. Constraining the power consumption of the servers is a widely used approach for managing energy costs and complying with power delivery limitations. In tandem, virtualization has become a common practice, as virtualization reduces hardware and power requirements by enabling consolidation of multiple applications on to a smaller set of physical resources. However, administration and management of data center resources have become more complex due to the growing number of virtualized servers installed in data centers. Therefore, designing autonomous and adaptive energy efficiency approaches is crucial to achieve sustainable and cost-efficient operation in data centers. Many modern data centers running enterprise workloads successfully implement energy efficiency approaches today. However, the nature of multi-threaded applications, which are becoming more common in all computing domains, brings additional design and management challenges. Tackling these challenges requires a deeper understanding of the interactions between the applications and the underlying hardware nodes. Although cluster-level management techniques bring significant benefits, node-level techniques provide more visibility into application characteristics, which can then be used to further improve the overall energy efficiency of the data centers. This thesis proposes adaptive runtime power and resource management techniques on multi-core systems. It demonstrates that taking the multi-threaded workload characteristics into account during management significantly improves the energy efficiency of the server nodes, which are the basic building blocks of data centers. The key distinguishing features of this work are as follows: We implement the proposed runtime techniques on state-of-the-art commodity multi-core servers and show that their energy efficiency can be significantly improved by (1) taking multi-threaded application specific characteristics into account while making resource allocation decisions, (2) accurately tracking dynamically changing power constraints by using low-overhead application-aware runtime techniques, and (3) coordinating dynamic adaptive decisions at various layers of the computing stack, specifically at system and application levels. Our results show that efficient resource distribution under power constraints yields energy savings of up to 24% compared to existing approaches, along with the ability to meet power constraints 98% of the time for a diverse set of multi-threaded applications

    Theory and Application of Dynamic Spatial Time Series Models

    Get PDF
    Stochastic economic processes are often characterized by dynamic interactions between variables that are dependent in both space and time. Analyzing these processes raises a number of questions about the econometric methods used that are both practically and theoretically interesting. This work studies econometric approaches to analyze spatial data that evolves dynamically over time. The book provides a background on least squares and maximum likelihood estimators, and discusses some of the limits of basic econometric theory. It then discusses the importance of addressing spatial heterogeneity in policies. The next chapters cover parametric modeling of linear and nonlinear spatial time series, non-parametric modeling of nonlinearities in panel data, modeling of multiple spatial time series variables that exhibit long and short memory, and probabilistic causality in spatial time series settings

    Multiscale Modeling and Simulation of Deformation Accumulation in Fault Networks

    Get PDF
    Strain accumulation and stress release along multiscale geological fault networks are fundamental mechanisms for earthquake and rupture processes in the lithosphere. Due to long periods of seismic quiescence, the scarcity of large earthquakes and incompleteness of paleoseismic, historical and instrumental record, there is a fundamental lack of insight into the multiscale, spatio-temporal nature of earthquake dynamics in fault networks. This thesis constitutes another step towards reliable earthquake prediction and quantitative hazard analysis. Its focus lies on developing a mathematical model for prototypical, layered fault networks on short time scales as well as their efficient numerical simulation. This exposition begins by establishing a fault system consisting of layered bodies with viscoelastic Kelvin-Voigt rheology and non-intersecting faults featuring rate-and-state friction as proposed by Dieterich and Ruina. The individual bodies are assumed to experience small viscoelastic deformations, but possibly large relative tangential displacements. Thereafter, semi-discretization in time with the classical Newmark scheme of the variational formulation yields a sequence of continuous, nonsmooth, coupled, spatial minimization problems for the velocities and states in each time step, that are decoupled by means of a fixed point iteration. Subsequently, spatial discretization is based on linear and piecewise constant finite elements for the rate and state problems, respectively. A dual mortar discretization of the non-penetration constraints entails a hierarchical decomposition of the discrete solution space, that enables the localization of the non-penetration condition. Exploiting the resulting structure, an algebraic representation of the parametrized rate problem can be solved efficiently using a variant of the Truncated Nonsmooth Newton Multigrid (TNNMG) method. It is globally convergent due to nonlinear, block Gauß–Seidel type smoothing and employs nonsmooth Newton and multigrid ideas to enhance robustness and efficiency of the overall method. A key step in the TNNMG algorithm is the efficient computation of a correction obtained from a linearized, inexact Newton step. The second part addresses the numerical homogenization of elliptic variational problems featuring fractal interface networks, that are structurally similar to the ones arising in the linearized correction step of the TNNMG method. Contrary to the previous setting, this model incorporates the full spatial complexity of geological fault networks in terms of truly multiscale fractal interface geometries. Here, the construction of projections from a fractal function space to finite element spaces with suitable approximation and stability properties constitutes the main contribution of this thesis. The existence of these projections enables the application of well-known approaches to numerical homogenization, such as localized orthogonal decomposition (LOD) for the construction of multiscale discretizations with optimal a priori error estimates or subspace correction methods, that lead to algebraic solvers with mesh- and scale-independent convergence rates. Finally, numerical experiments with a single fault and the layered multiscale fault system illustrate the properties of the mathematical model as well as the efficiency, reliability and scale-independence of the suggested algebraic solver
    • 

    corecore