1,202 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Method for Designing Semantic Annotation of Sepsis Signs in Clinical Text
Annotated clinical text corpora are essential for machine learning studies that model and predict care processes and disease progression. However, few studies describe the necessary experimental design of the annotation guideline and annotation phases. This makes replication, reuse, and adoption challenging.
Using clinical questions about sepsis, we designed a semantic annotation guideline to capture sepsis signs from clinical text. The clinical questions aid guideline design, application, and evaluation. Our method incrementally evaluates each change in the guideline by testing the resulting annotated corpus using clinical questions. Additionally, our method uses inter-annotator agreement to judge the annotator compliance and quality of the guideline. We show that the method, combined with controlled design increments, is simple and allows the development and measurable improvement of a purpose-built semantic annotation guideline. We believe that our approach is useful for incremental design of semantic annotation guidelines in general
Convex Rank Tests and Semigraphoids
Convex rank tests are partitions of the symmetric group which have desirable
geometric properties. The statistical tests defined by such partitions involve
counting all permutations in the equivalence classes. Each class consists of
the linear extensions of a partially ordered set specified by data. Our methods
refine existing rank tests of non-parametric statistics, such as the sign test
and the runs test, and are useful for exploratory analysis of ordinal data. We
establish a bijection between convex rank tests and probabilistic conditional
independence structures known as semigraphoids. The subclass of submodular rank
tests is derived from faces of the cone of submodular functions, or from
Minkowski summands of the permutohedron. We enumerate all small instances of
such rank tests. Of particular interest are graphical tests, which correspond
to both graphical models and to graph associahedra
A Categorical Framework for Learning Generalised Tree Automata
Automata learning is a popular technique used to automatically construct an
automaton model from queries. Much research went into devising ad hoc
adaptations of algorithms for different types of automata. The CALF project
seeks to unify these using category theory in order to ease correctness proofs
and guide the design of new algorithms. In this paper, we extend CALF to cover
learning of algebraic structures that may not have a coalgebraic presentation.
Furthermore, we provide a detailed algorithmic account of an abstract version
of the popular L* algorithm, which was missing from CALF. We instantiate the
abstract theory to a large class of Set functors, by which we recover for the
first time practical tree automata learning algorithms from an abstract
framework and at the same time obtain new algorithms to learn algebras of
quotiented polynomial functors
Automated and Sound Synthesis of Lyapunov Functions with SMT Solvers
In this paper we employ SMT solvers to soundly synthesise Lyapunov functions
that assert the stability of a given dynamical model. The search for a Lyapunov
function is framed as the satisfiability of a second-order logical formula,
asking whether there exists a function satisfying a desired specification
(stability) for all possible initial conditions of the model. We synthesise
Lyapunov functions for linear, non-linear (polynomial), and for parametric
models. For non-linear models, the algorithm also determines a region of
validity for the Lyapunov function. We exploit an inductive framework to
synthesise Lyapunov functions, starting from parametric templates. The
inductive framework comprises two elements: a learner proposes a Lyapunov
function, and a verifier checks its validity - its lack is expressed via a
counterexample (a point over the state space), for further use by the learner.
Whilst the verifier uses the SMT solver Z3, thus ensuring the overall soundness
of the procedure, we examine two alternatives for the learner: a numerical
approach based on the optimisation tool Gurobi, and a sound approach based
again on Z3. The overall technique is evaluated over a broad set of benchmarks,
which shows that this methodology not only scales to 10-dimensional models
within reasonable computational time, but also offers a novel soundness proof
for the generated Lyapunov functions and their domains of validity
- …