3,665 research outputs found
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar
A usage-based Construction Grammar (CxG) posits that slot-constraints
generalize from common exemplar constructions. But what is the best model of
constraint generalization? This paper evaluates competing frequency-based and
association-based models across eight languages using a metric derived from the
Minimum Description Length paradigm. The experiments show that
association-based models produce better generalizations across all languages by
a significant margin
Technical Report: CSVM Ecosystem
The CSVM format is derived from CSV format and allows the storage of tabular
like data with a limited but extensible amount of metadata. This approach could
help computer scientists because all information needed to uses subsequently
the data is included in the CSVM file and is particularly well suited for
handling RAW data in a lot of scientific fields and to be used as a canonical
format. The use of CSVM has shown that it greatly facilitates: the data
management independently of using databases; the data exchange; the integration
of RAW data in dataflows or calculation pipes; the search for best practices in
RAW data management. The efficiency of this format is closely related to its
plasticity: a generic frame is given for all kind of data and the CSVM parsers
don't make any interpretation of data types. This task is done by the
application layer, so it is possible to use same format and same parser codes
for a lot of purposes. In this document some implementation of CSVM format for
ten years and in different laboratories are presented. Some programming
examples are also shown: a Python toolkit for using the format, manipulating
and querying is available. A first specification of this format (CSVM-1) is now
defined, as well as some derivatives such as CSVM dictionaries used for data
interchange. CSVM is an Open Format and could be used as a support for Open
Data and long term conservation of RAW or unpublished data.Comment: 31 pages including 2p of Anne
Image and interpretation using artificial intelligence to read ancient Roman texts
The ink and stylus tablets discovered at the Roman Fort of Vindolanda are a unique resource for scholars of ancient history. However, the stylus tablets have proved particularly difficult to read. This paper describes a system that assists expert papyrologists in the interpretation of the Vindolanda writing tablets. A model-based approach is taken that relies on models of the written form of characters, and statistical modelling of language, to produce plausible interpretations of the documents. Fusion of the contributions from the language, character, and image feature models is achieved by utilizing the GRAVA agent architecture that uses Minimum Description Length as the basis for information fusion across semantic levels. A system is developed that reads in image data and outputs plausible interpretations of the Vindolanda tablets
- âŠ