99 research outputs found
Decision Problems in Information Theory
Constraints on entropies are considered to be the laws of information theory. Even though the pursuit of their discovery has been a central theme of research in information theory, the algorithmic aspects of constraints on entropies remain largely unexplored. Here, we initiate an investigation of decision problems about constraints on entropies by placing several different such problems into levels of the arithmetical hierarchy. We establish the following results on checking the validity over all almost-entropic functions: first, validity of a Boolean information constraint arising from a monotone Boolean formula is co-recursively enumerable; second, validity of "tight" conditional information constraints is in ???. Furthermore, under some restrictions, validity of conditional information constraints "with slack" is in ???, and validity of information inequality constraints involving max is Turing equivalent to validity of information inequality constraints (with no max involved). We also prove that the classical implication problem for conditional independence statements is co-recursively enumerable
ProS: Data Series Progressive k-NN Similarity Search and Classification with Probabilistic Quality Guarantees
Existing systems dealing with the increasing volume of data series cannot
guarantee interactive response times, even for fundamental tasks such as
similarity search. Therefore, it is necessary to develop analytic approaches
that support exploration and decision making by providing progressive results,
before the final and exact ones have been computed. Prior works lack both
efficiency and accuracy when applied to large-scale data series collections. We
present and experimentally evaluate ProS, a new probabilistic learning-based
method that provides quality guarantees for progressive Nearest Neighbor (NN)
query answering. We develop our method for k-NN queries and demonstrate how it
can be applied with the two most popular distance measures, namely, Euclidean
and Dynamic Time Warping (DTW). We provide both initial and progressive
estimates of the final answer that are getting better during the similarity
search, as well suitable stopping criteria for the progressive queries.
Moreover, we describe how this method can be used in order to develop a
progressive algorithm for data series classification (based on a k-NN
classifier), and we additionally propose a method designed specifically for the
classification task. Experiments with several and diverse synthetic and real
datasets demonstrate that our prediction methods constitute the first practical
solutions to the problem, significantly outperforming competing approaches.
This paper was published in the VLDB Journal (2022)
Graphical Conjunctive Queries
The Calculus of Conjunctive Queries (CCQ) has foundational status in database theory. A celebrated theorem of Chandra and Merlin states that CCQ query inclusion is decidable. Its proof transforms logical formulas to graphs: each query has a natural model - a kind of graph - and query inclusion reduces to the existence of a graph homomorphism between natural models.
We introduce the diagrammatic language Graphical Conjunctive Queries (GCQ) and show that it has the same expressivity as CCQ. GCQ terms are string diagrams, and their algebraic structure allows us to derive a sound and complete axiomatisation of query inclusion, which turns out to be exactly Carboni and Walters\u27 notion of cartesian bicategory of relations. Our completeness proof exploits the combinatorial nature of string diagrams as (certain cospans of) hypergraphs: Chandra and Merlin\u27s insights inspire a theorem that relates such cospans with spans. Completeness and decidability of the (in)equational theory of GCQ follow as a corollary. Categorically speaking, our contribution is a model-theoretic completeness theorem of free cartesian bicategories (on a relational signature) for the category of sets and relations
Large Scale Spectral Clustering Using Approximate Commute Time Embedding
Spectral clustering is a novel clustering method which can detect complex
shapes of data clusters. However, it requires the eigen decomposition of the
graph Laplacian matrix, which is proportion to and thus is not
suitable for large scale systems. Recently, many methods have been proposed to
accelerate the computational time of spectral clustering. These approximate
methods usually involve sampling techniques by which a lot information of the
original data may be lost. In this work, we propose a fast and accurate
spectral clustering approach using an approximate commute time embedding, which
is similar to the spectral embedding. The method does not require using any
sampling technique and computing any eigenvector at all. Instead it uses random
projection and a linear time solver to find the approximate embedding. The
experiments in several synthetic and real datasets show that the proposed
approach has better clustering quality and is faster than the state-of-the-art
approximate spectral clustering methods
- …