1,130 research outputs found
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
What to do about non-standard (or non-canonical) language in NLP
Real world data differs radically from the benchmark corpora we use in
natural language processing (NLP). As soon as we apply our technologies to the
real world, performance drops. The reason for this problem is obvious: NLP
models are trained on samples from a limited set of canonical varieties that
are considered standard, most prominently English newswire. However, there are
many dimensions, e.g., socio-demographics, language, genre, sentence type, etc.
on which texts can differ from the standard. The solution is not obvious: we
cannot control for all factors, and it is not clear how to best go beyond the
current practice of training on homogeneous data from a single domain and
language.
In this paper, I review the notion of canonicity, and how it shapes our
community's approach to language. I argue for leveraging what I call fortuitous
data, i.e., non-obvious data that is hitherto neglected, hidden in plain sight,
or raw data that needs to be refined. If we embrace the variety of this
heterogeneous data by combining it with proper algorithms, we will not only
produce more robust models, but will also enable adaptive language technology
capable of addressing natural language variation.Comment: KONVENS 201
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
- …