20,057 research outputs found
TEI and LMF crosswalks
The present paper explores various arguments in favour of making the Text
Encoding Initia-tive (TEI) guidelines an appropriate serialisation for ISO
standard 24613:2008 (LMF, Lexi-cal Mark-up Framework) . It also identifies the
issues that would have to be resolved in order to reach an appropriate
implementation of these ideas, in particular in terms of infor-mational
coverage. We show how the customisation facilities offered by the TEI
guidelines can provide an adequate background, not only to cover missing
components within the current Dictionary chapter of the TEI guidelines, but
also to allow specific lexical projects to deal with local constraints. We
expect this proposal to be a basis for a future ISO project in the context of
the on going revision of LMF
Stabilizing knowledge through standards - A perspective for the humanities
It is usual to consider that standards generate mixed feelings among
scientists. They are often seen as not really reflecting the state of the art
in a given domain and a hindrance to scientific creativity. Still, scientists
should theoretically be at the best place to bring their expertise into
standard developments, being even more neutral on issues that may typically be
related to competing industrial interests. Even if it could be thought of as
even more complex to think about developping standards in the humanities, we
will show how this can be made feasible through the experience gained both
within the Text Encoding Initiative consortium and the International
Organisation for Standardisation. By taking the specific case of lexical
resources, we will try to show how this brings about new ideas for designing
future research infrastructures in the human and social sciences
A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese
Psycholinguistic properties of words have been used in various approaches to
Natural Language Processing tasks, such as text simplification and readability
assessment. Most of these properties are subjective, involving costly and
time-consuming surveys to be gathered. Recent approaches use the limited
datasets of psycholinguistic properties to extend them automatically to large
lexicons. However, some of the resources used by such approaches are not
available to most languages. This study presents a method to infer
psycholinguistic properties for Brazilian Portuguese (BP) using regressors
built with a light set of features usually available for less resourced
languages: word length, frequency lists, lexical databases composed of school
dictionaries and word embedding models. The correlations between the properties
inferred are close to those obtained by related works. The resulting resource
contains 26,874 words in BP annotated with concreteness, age of acquisition,
imageability and subjective frequency.Comment: Paper accepted for TSD201
A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units
We address the design of a unified multilingual system for handwriting
recognition. Most of multi- lingual systems rests on specialized models that
are trained on a single language and one of them is selected at test time.
While some recognition systems are based on a unified optical model, dealing
with a unified language model remains a major issue, as traditional language
models are generally trained on corpora composed of large word lexicons per
language. Here, we bring a solution by con- sidering language models based on
sub-lexical units, called multigrams. Dealing with multigrams strongly reduces
the lexicon size and thus decreases the language model complexity. This makes
pos- sible the design of an end-to-end unified multilingual recognition system
where both a single optical model and a single language model are trained on
all the languages. We discuss the impact of the language unification on each
model and show that our system reaches state-of-the-art methods perfor- mance
with a strong reduction of the complexity.Comment: preprin
- …