14,903 research outputs found

    Infinite Probabilistic Databases

    Get PDF
    Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open-world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009). We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries

    Generalized h-index for Disclosing Latent Facts in Citation Networks

    Full text link
    What is the value of a scientist and its impact upon the scientific thinking? How can we measure the prestige of a journal or of a conference? The evaluation of the scientific work of a scientist and the estimation of the quality of a journal or conference has long attracted significant interest, due to the benefits from obtaining an unbiased and fair criterion. Although it appears to be simple, defining a quality metric is not an easy task. To overcome the disadvantages of the present metrics used for ranking scientists and journals, J.E. Hirsch proposed a pioneering metric, the now famous h-index. In this article, we demonstrate several inefficiencies of this index and develop a pair of generalizations and effective variants of it to deal with scientist ranking and with publication forum ranking. The new citation indices are able to disclose trendsetters in scientific research, as well as researchers that constantly shape their field with their influential work, no matter how old they are. We exhibit the effectiveness and the benefits of the new indices to unfold the full potential of the h-index, with extensive experimental results obtained from DBLP, a widely known on-line digital library.Comment: 19 pages, 17 tables, 27 figure

    DataSpread: Unifying Databases and Spreadsheets.

    Get PDF
    Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current pane (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases
    corecore