7,334 research outputs found
What to do about non-standard (or non-canonical) language in NLP
Real world data differs radically from the benchmark corpora we use in
natural language processing (NLP). As soon as we apply our technologies to the
real world, performance drops. The reason for this problem is obvious: NLP
models are trained on samples from a limited set of canonical varieties that
are considered standard, most prominently English newswire. However, there are
many dimensions, e.g., socio-demographics, language, genre, sentence type, etc.
on which texts can differ from the standard. The solution is not obvious: we
cannot control for all factors, and it is not clear how to best go beyond the
current practice of training on homogeneous data from a single domain and
language.
In this paper, I review the notion of canonicity, and how it shapes our
community's approach to language. I argue for leveraging what I call fortuitous
data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight,
or raw data that needs to be refined. If we embrace the variety of this
heterogeneous data by combining it with proper algorithms, we will not only
produce more robust models, but will also enable adaptive language technology
capable of addressing natural language variation.Comment: KONVENS 201
Semi-automatic annotation process for procedural texts: An application on cooking recipes
Taaable is a case-based reasoning system that adapts cooking recipes to user
constraints. Within it, the preparation part of recipes is formalised as a
graph. This graph is a semantic representation of the sequence of instructions
composing the cooking process and is used to compute the procedure adaptation,
conjointly with the textual adaptation. It is composed of cooking actions and
ingredients, among others, represented as vertices, and semantic relations
between those, shown as arcs, and is built automatically thanks to natural
language processing. The results of the automatic annotation process is often a
disconnected graph, representing an incomplete annotation, or may contain
errors. Therefore, a validating and correcting step is required. In this paper,
we present an existing graphic tool named \kcatos, conceived for representing
and editing decision trees, and show how it has been adapted and integrated in
WikiTaaable, the semantic wiki in which the knowledge used by Taaable is
stored. This interface provides the wiki users with a way to correct the case
representation of the cooking process, improving at the same time the quality
of the knowledge about cooking procedures stored in WikiTaaable
Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective
This paper presents a Lisp architecture for a portable NLP system, termed
LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard,
customized and in-house developed NLP tools. Our system facilitates portability
across different institutions and data systems by incorporating an enriched
Common Data Model (CDM) to standardize necessary data elements. It utilizes
UMLS to perform domain adaptation when integrating generic domain NLP tools. It
also features stand-off annotations that are specified by positional reference
to the original document. We built an interval tree based search engine to
efficiently query and retrieve the stand-off annotations by specifying
positional requirements. We also developed a utility to convert an inline
annotation format to stand-off annotations to enable the reuse of clinical text
datasets with inline annotations. We experimented with our system on several
NLP facilitated tasks including computational phenotyping for lymphoma patients
and semantic relation extraction for clinical notes. These experiments
showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape
A Legal Perspective on Training Models for Natural Language Processing
A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning
model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights
Automatic case acquisition from texts for process-oriented case-based reasoning
This paper introduces a method for the automatic acquisition of a rich case
representation from free text for process-oriented case-based reasoning. Case
engineering is among the most complicated and costly tasks in implementing a
case-based reasoning system. This is especially so for process-oriented
case-based reasoning, where more expressive case representations are generally
used and, in our opinion, actually required for satisfactory case adaptation.
In this context, the ability to acquire cases automatically from procedural
texts is a major step forward in order to reason on processes. We therefore
detail a methodology that makes case acquisition from processes described as
free text possible, with special attention given to assembly instruction texts.
This methodology extends the techniques we used to extract actions from cooking
recipes. We argue that techniques taken from natural language processing are
required for this task, and that they give satisfactory results. An evaluation
based on our implemented prototype extracting workflows from recipe texts is
provided.Comment: Sous presse, publication pr\'evue en 201
Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP
We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.Postprint (published version
- …