3,625 research outputs found
FASTSUBS: An Efficient and Exact Procedure for Finding the Most Likely Lexical Substitutes Based on an N-gram Language Model
Lexical substitutes have found use in areas such as paraphrasing, text
simplification, machine translation, word sense disambiguation, and part of
speech induction. However the computational complexity of accurately
identifying the most likely substitutes for a word has made large scale
experiments difficult. In this paper I introduce a new search algorithm,
FASTSUBS, that is guaranteed to find the K most likely lexical substitutes for
a given word in a sentence based on an n-gram language model. The computation
is sub-linear in both K and the vocabulary size V. An implementation of the
algorithm and a dataset with the top 100 substitutes of each token in the WSJ
section of the Penn Treebank are available at http://goo.gl/jzKH0.Comment: 4 pages, 1 figure, to appear in IEEE Signal Processing Letter
Simplification-induced transformations: typology and some characteristics
International audienceThe purpose of automatic text simplification is to transform technical or difficult to understand texts into a more friendly version. The semantics must be preserved during this transformation. Automatic text simplification can be done at different levels (lexical, syntactic, semantic, stylistic...) and relies on the corresponding knowledge and resources (lexicon, rules...). Our objective is to propose methods and material for the creation of transformation rules from a small set of parallel sentences differentiated by their technicity. We also propose a typology of transformations and quantify them. We work with French-language data related to the medical domain, although we assume that the method can be exploited on texts in any language and from any domain
Effects of lexical properties on viewing time per word in autistic and neurotypical readers
Eye tracking studies from the past few
decades have shaped the way we think
of word complexity and cognitive load:
words that are long, rare and ambiguous
are more difficult to read. However, online
processing techniques have been scarcely
applied to investigating the reading difficulties of people with autism and what
vocabulary is challenging for them. We
present parallel gaze data obtained from
adult readers with autism and a control
group of neurotypical readers and show
that the former required higher cognitive
effort to comprehend the texts as evidenced by three gaze-based measures. We
divide all words into four classes based on
their viewing times for both groups and investigate the relationship between longer
viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age
of acquisition and imagability).University of Wolverhampton and German Research Foundation (DFG
Evaluating prose style transfer with the Bible
In the prose style transfer task a system, provided with text input and a
target prose style, produces output which preserves the meaning of the input
text but alters the style. These systems require parallel data for evaluation
of results and usually make use of parallel data for training. Currently, there
are few publicly available corpora for this task. In this work, we identify a
high-quality source of aligned, stylistically distinct text in different
versions of the Bible. We provide a standardized split, into training,
development and testing data, of the public domain versions in our corpus. This
corpus is highly parallel since many Bible versions are included. Sentences are
aligned due to the presence of chapter and verse numbers within all versions of
the text. In addition to the corpus, we present the results, as measured by the
BLEU and PINC metrics, of several models trained on our data which can serve as
baselines for future research. While we present these data as a style transfer
corpus, we believe that it is of unmatched quality and may be useful for other
natural language tasks as well
An Efficient Implementation of the Head-Corner Parser
This paper describes an efficient and robust implementation of a
bi-directional, head-driven parser for constraint-based grammars. This parser
is developed for the OVIS system: a Dutch spoken dialogue system in which
information about public transport can be obtained by telephone.
After a review of the motivation for head-driven parsing strategies, and
head-corner parsing in particular, a non-deterministic version of the
head-corner parser is presented. A memoization technique is applied to obtain a
fast parser. A goal-weakening technique is introduced which greatly improves
average case efficiency, both in terms of speed and space requirements.
I argue in favor of such a memoization strategy with goal-weakening in
comparison with ordinary chart-parsers because such a strategy can be applied
selectively and therefore enormously reduces the space requirements of the
parser, while no practical loss in time-efficiency is observed. On the
contrary, experiments are described in which head-corner and left-corner
parsers implemented with selective memoization and goal weakening outperform
`standard' chart parsers. The experiments include the grammar of the OVIS
system and the Alvey NL Tools grammar.
Head-corner parsing is a mix of bottom-up and top-down processing. Certain
approaches towards robust parsing require purely bottom-up processing.
Therefore, it seems that head-corner parsing is unsuitable for such robust
parsing techniques. However, it is shown how underspecification (which arises
very naturally in a logic programming environment) can be used in the
head-corner parser to allow such robust parsing techniques. A particular robust
parsing model is described which is implemented in OVIS.Comment: 31 pages, uses cl.st
Into Intelligible Pronunciation Features of Thai English in English as a Lingua Franca Context
Regardless of whether or not Thai English, also known as ‘Tinglish’, has acquired a status of a ‘new variety of English’, it is undoubted that ‘Thai English accent’ exists among Thai people and involves unique Thai English phonological properties. This research paper examined pronunciation features of Thai English collected from 30 students from a private university in Thailand and compared the Thai English phonological properties to those in the Lingua Franca Core (LFC) proposed by Jenkins (2000). Participants were required to perform different tasks where they could use English naturally. From the findings, Thai English speakers typically operate with a smaller set of consonants (ThE: 17, RP: 24). In particular, there is no voicing contrast in fricative sounds, while most of the other consonant phonemes remain. In addition, most of the English vowels were replaced with Thai regional qualities and similar sets of vowels were observed (ThE: 19, RP: 20). After comparing to the LFC, six features were identified as problematic which could lead to intelligibility failure, including 1) consonant substitution, 2) final consonant devoicing, 3) deletion and substitution of [ɬ], 4) conflation of /l/ and /r/, 5) initial cluster simplification, and 6) non-tonic stress. On the other hand, six other features of Thai English features were considered intelligible in ELF, including non-rhotic pronunciation, vowel substitution, monophongization, syllable-timed stress, non-intonation pattern, and tone transfer. The Thai English pronunciation core from this research could be especially useful in English pronunciation teaching in Thailand, where learners can comfortably accommodate their English to achieve successful communications in international contexts
Automatic case acquisition from texts for process-oriented case-based reasoning
This paper introduces a method for the automatic acquisition of a rich case
representation from free text for process-oriented case-based reasoning. Case
engineering is among the most complicated and costly tasks in implementing a
case-based reasoning system. This is especially so for process-oriented
case-based reasoning, where more expressive case representations are generally
used and, in our opinion, actually required for satisfactory case adaptation.
In this context, the ability to acquire cases automatically from procedural
texts is a major step forward in order to reason on processes. We therefore
detail a methodology that makes case acquisition from processes described as
free text possible, with special attention given to assembly instruction texts.
This methodology extends the techniques we used to extract actions from cooking
recipes. We argue that techniques taken from natural language processing are
required for this task, and that they give satisfactory results. An evaluation
based on our implemented prototype extracting workflows from recipe texts is
provided.Comment: Sous presse, publication pr\'evue en 201
- …