7,297 research outputs found
Mixing set and bag semantics
The conservativity theorem for nested relational calculus implies that query
expressions can freely use nesting and unnesting, yet as long as the query
result type is a flat relation, these capabilities do not lead to an increase
in expressiveness over flat relational queries. Moreover, Wong showed how such
queries can be translated to SQL via a constructive rewriting algorithm. While
this result holds for queries over either set or multiset semantics, to the
best of our knowledge, the questions of conservativity and normalization have
not been studied for queries that mix set and bag collections, or provide
duplicate-elimination operations such as SQL's
. In this paper we formalize the problem,
and present partial progress: specifically, we introduce a calculus with both
set and multiset collection types, along with natural mappings from sets to
bags and vice versa, present a set of valid rewrite rules for normalizing such
queries, and give an inductive characterization of a set of queries whose
normal forms can be translated to SQL. We also consider examples that do not
appear straightforward to translate to SQL, illustrating that the relative
expressiveness of flat and nested queries with mixed set and multiset semantics
remains an open question.Comment: DBPL 2019 -- short pape
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
A Simple Method to Produce Algorithmic MIDI Music based on Randomness, Simple Probabilities and Multi-Threading
This paper introduces a simple method for producing multichannel MIDI music
that is based on randomness and simple probabilities. One distinctive feature
of the method is that it produces and sends in parallel to the sound card more
than one unsynchronized channels by exploiting the multi-threading capabilities
of general purpose programming languages. As consequence the derived sound
offers a quite ``full" and ``unpredictable" acoustic experience to the
listener. Subsequently the paper reports the results of an evaluation with
users. The results were very surprising: the majority of users responded that
they could tolerate this music in various occasions.Comment: 7 pages, 5 figure
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
- …