2,633 research outputs found
On-Line Error Detection of Annotated Corpus Using Modular Neural Networks
Abstract. This paper proposes an on-line error detecting method for a manually annotated corpus using min-max modular (M3) neural net-works. The basic idea of the method is to use guaranteed convergence of the M3 network to detect errors in learning data. To confirm the ef-fectiveness of the method, a preliminary computer experiment was per-formed on a small Japanese corpus containing 217 sentences. The results show that the method can not only detect errors within a corpus, but may also discover some kinds of knowledge or rules useful for natural language processing.
MoNoise: Modeling Noise Using a Modular Normalization System
We propose MoNoise: a normalization model focused on generalizability and
efficiency, it aims at being easily reusable and adaptable. Normalization is
the task of translating texts from a non- canonical domain to a more canonical
domain, in our case: from social media data to standard language. Our proposed
model is based on a modular candidate generation in which each module is
responsible for a different type of normalization action. The most important
generation modules are a spelling correction system and a word embeddings
module. Depending on the definition of the normalization task, a static lookup
list can be crucial for performance. We train a random forest classifier to
rank the candidates, which generalizes well to all different types of
normaliza- tion actions. Most features for the ranking originate from the
generation modules; besides these features, N-gram features prove to be an
important source of information. We show that MoNoise beats the
state-of-the-art on different normalization benchmarks for English and Dutch,
which all define the task of normalization slightly different.Comment: Source code: https://bitbucket.org/robvanderg/monois
DeepScribe: Localization and Classification of Elamite Cuneiform Signs Via Deep Learning
Twenty-five hundred years ago, the paperwork of the Achaemenid Empire was
recorded on clay tablets. In 1933, archaeologists from the University of
Chicago's Oriental Institute (OI) found tens of thousands of these tablets and
fragments during the excavation of Persepolis. Many of these tablets have been
painstakingly photographed and annotated by expert cuneiformists, and now
provide a rich dataset consisting of over 5,000 annotated tablet images and
100,000 cuneiform sign bounding boxes. We leverage this dataset to develop
DeepScribe, a modular computer vision pipeline capable of localizing cuneiform
signs and providing suggestions for the identity of each sign. We investigate
the difficulty of learning subtasks relevant to cuneiform tablet transcription
on ground-truth data, finding that a RetinaNet object detector can achieve a
localization mAP of 0.78 and a ResNet classifier can achieve a top-5 sign
classification accuracy of 0.89. The end-to-end pipeline achieves a top-5
classification accuracy of 0.80. As part of the classification module,
DeepScribe groups cuneiform signs into morphological clusters. We consider how
this automatic clustering approach differs from the organization of standard,
printed sign lists and what we may learn from it. These components, trained
individually, are sufficient to produce a system that can analyze photos of
cuneiform tablets from the Achaemenid period and provide useful transliteration
suggestions to researchers. We evaluate the model's end-to-end performance on
locating and classifying signs, providing a roadmap to a linguistically-aware
transliteration system, then consider the model's potential utility when
applied to other periods of cuneiform writing.Comment: Currently under review in the ACM JOCC
Survey on Evaluation Methods for Dialogue Systems
In this paper we survey the methods and concepts developed for the evaluation
of dialogue systems. Evaluation is a crucial part during the development
process. Often, dialogue systems are evaluated by means of human evaluations
and questionnaires. However, this tends to be very cost and time intensive.
Thus, much work has been put into finding methods, which allow to reduce the
involvement of human labour. In this survey, we present the main concepts and
methods. For this, we differentiate between the various classes of dialogue
systems (task-oriented dialogue systems, conversational dialogue systems, and
question-answering dialogue systems). We cover each class by introducing the
main technologies developed for the dialogue systems and then by presenting the
evaluation methods regarding this class
ORCA-SPOT: An Automatic Killer Whale Sound Detection Toolkit Using Deep Learning
Large bioacoustic archives of wild animals are an important source to identify reappearing communication patterns, which can then be related to recurring behavioral patterns to advance the current understanding of intra-specific communication of non-human animals. A main challenge remains that most large-scale bioacoustic archives contain only a small percentage of animal vocalizations and a large amount of environmental noise, which makes it extremely difficult to manually retrieve sufficient vocalizations for further analysis – particularly important for species with advanced social systems and complex vocalizations. In this study deep neural networks were trained on 11,509 killer whale (Orcinus orca) signals and 34,848 noise segments. The resulting toolkit ORCA-SPOT was tested on a large-scale bioacoustic repository – the Orchive – comprising roughly 19,000 hours of killer whale underwater recordings. An automated segmentation of the entire Orchive recordings (about 2.2 years) took approximately 8 days. It achieved a time-based precision or positive-predictive-value (PPV) of 93.2% and an area-under-the-curve (AUC) of 0.9523. This approach enables an automated annotation procedure of large bioacoustics databases to extract killer whale sounds, which are essential for subsequent identification of significant communication patterns. The code will be publicly available in October 2019 to support the application of deep learning to bioaoucstic research. ORCA-SPOT can be adapted to other animal species
Recommended from our members
Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes.
There is a great and growing need to ascertain what exactly is the state of a patient, in terms of disease progression, actual care practices, pathology, adverse events, and much more, beyond the paucity of data available in structured medical record data. Ascertaining these harder-to-reach data elements is now critical for the accurate phenotyping of complex traits, detection of adverse outcomes, efficacy of off-label drug use, and longitudinal patient surveillance. Clinical notes often contain the most detailed and relevant digital information about individual patients, the nuances of their diseases, the treatment strategies selected by physicians, and the resulting outcomes. However, notes remain largely unused for research because they contain Protected Health Information (PHI), which is synonymous with individually identifying data. Previous clinical note de-identification approaches have been rigid and still too inaccurate to see any substantial real-world use, primarily because they have been trained with too small medical text corpora. To build a new de-identification tool, we created the largest manually annotated clinical note corpus for PHI and develop a customizable open-source de-identification software called Philter ("Protected Health Information filter"). Here we describe the design and evaluation of Philter, and show how it offers substantial real-world improvements over prior methods
SCREEN: Learning a Flat Syntactic and Semantic Spoken Language Analysis Using Artificial Neural Networks
In this paper, we describe a so-called screening approach for learning robust
processing of spontaneously spoken language. A screening approach is a flat
analysis which uses shallow sequences of category representations for analyzing
an utterance at various syntactic, semantic and dialog levels. Rather than
using a deeply structured symbolic analysis, we use a flat connectionist
analysis. This screening approach aims at supporting speech and language
processing by using (1) data-driven learning and (2) robustness of
connectionist networks. In order to test this approach, we have developed the
SCREEN system which is based on this new robust, learned and flat analysis.
In this paper, we focus on a detailed description of SCREEN's architecture,
the flat syntactic and semantic analysis, the interaction with a speech
recognizer, and a detailed evaluation analysis of the robustness under the
influence of noisy or incomplete input. The main result of this paper is that
flat representations allow more robust processing of spontaneous spoken
language than deeply structured representations. In particular, we show how the
fault-tolerance and learning capability of connectionist networks can support a
flat analysis for providing more robust spoken-language processing within an
overall hybrid symbolic/connectionist framework.Comment: 51 pages, Postscript. To be published in Journal of Artificial
Intelligence Research 6(1), 199
Effective Feature Representation for Clinical Text Concept Extraction
Crucial information about the practice of healthcare is recorded only in
free-form text, which creates an enormous opportunity for high-impact NLP.
However, annotated healthcare datasets tend to be small and expensive to
obtain, which raises the question of how to make maximally efficient uses of
the available data. To this end, we develop an LSTM-CRF model for combining
unsupervised word representations and hand-built feature representations
derived from publicly available healthcare ontologies. We show that this
combined model yields superior performance on five datasets of diverse kinds of
healthcare text (clinical, social, scientific, commercial). Each involves the
labeling of complex, multi-word spans that pick out different healthcare
concepts. We also introduce a new labeled dataset for identifying the treatment
relations between drugs and diseases
- …