859 research outputs found
Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks
Selecting optimal parameters for a neural network architecture can often make
the difference between mediocre and state-of-the-art performance. However,
little is published which parameters and design choices should be evaluated or
selected making the correct hyperparameter optimization often a "black art that
requires expert experiences" (Snoek et al., 2012). In this paper, we evaluate
the importance of different network design choices and hyperparameters for five
common linguistic sequence tagging tasks (POS, Chunking, NER, Entity
Recognition, and Event Detection). We evaluated over 50.000 different setups
and found, that some parameters, like the pre-trained word embeddings or the
last layer of the network, have a large impact on the performance, while other
parameters, for example the number of LSTM layers or the number of recurrent
units, are of minor importance. We give a recommendation on a configuration
that performs well among different tasks.Comment: 34 pages. 9 page version of this paper published at EMNLP 201
Alternative Weighting Schemes for ELMo Embeddings
ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP community
and may recent publications use these embeddings to boost the performance for
downstream NLP tasks. However, integration of ELMo embeddings in existent NLP
architectures is not straightforward. In contrast to traditional word
embeddings, like GloVe or word2vec embeddings, the bi-directional language
model of ELMo produces three 1024 dimensional vectors per token in a sentence.
Peters et al. proposed to learn a task-specific weighting of these three
vectors for downstream tasks. However, this proposed weighting scheme is not
feasible for certain tasks, and, as we will show, it does not necessarily yield
optimal performance. We evaluate different methods that combine the three
vectors from the language model in order to achieve the best possible
performance in downstream NLP tasks. We notice that the third layer of the
published language model often decreases the performance. By learning a
weighted average of only the first two layers, we are able to improve the
performance for many datasets. Due to the reduced complexity of the language
model, we have a training speed-up of 19-44% for the downstream task
Modeling Semantics with Gated Graph Neural Networks for Knowledge Base Question Answering
The most approaches to Knowledge Base Question Answering are based on
semantic parsing. In this paper, we address the problem of learning vector
representations for complex semantic parses that consist of multiple entities
and relations. Previous work largely focused on selecting the correct semantic
relations for a question and disregarded the structure of the semantic parse:
the connections between entities and the directions of the relations. We
propose to use Gated Graph Neural Networks to encode the graph structure of the
semantic parse. We show on two data sets that the graph networks outperform all
baseline models that do not explicitly model the structure. The error analysis
confirms that our approach can successfully process complex semantic parses.Comment: Accepted as COLING 2018 Long Paper, 12 page
Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging
In this paper we show that reporting a single performance score is
insufficient to compare non-deterministic approaches. We demonstrate for common
sequence tagging tasks that the seed value for the random number generator can
result in statistically significant (p < 10^-4) differences for
state-of-the-art systems. For two recent systems for NER, we observe an
absolute difference of one percentage point F1-score depending on the selected
seed value, making these systems perceived either as state-of-the-art or
mediocre. Instead of publishing and reporting single performance scores, we
propose to compare score distributions based on multiple executions. Based on
the evaluation of 50.000 LSTM-networks for five sequence tagging tasks, we
present network architectures that produce both superior performance as well as
are more stable with respect to the remaining hyperparameters.Comment: Accepted at EMNLP 201
Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps
Concept maps can be used to concisely represent important information and
bring structure into large document collections. Therefore, we study a variant
of multi-document summarization that produces summaries in the form of concept
maps. However, suitable evaluation datasets for this task are currently
missing. To close this gap, we present a newly created corpus of concept maps
that summarize heterogeneous collections of web documents on educational
topics. It was created using a novel crowdsourcing approach that allows us to
efficiently determine important elements in large document collections. We
release the corpus along with a baseline system and proposed evaluation
protocol to enable further research on this variant of summarization.Comment: Published at EMNLP 201
Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches
Developing state-of-the-art approaches for specific tasks is a major driving
force in our research community. Depending on the prestige of the task,
publishing it can come along with a lot of visibility. The question arises how
reliable are our evaluation methodologies to compare approaches?
One common methodology to identify the state-of-the-art is to partition data
into a train, a development and a test set. Researchers can train and tune
their approach on some part of the dataset and then select the model that
worked best on the development set for a final evaluation on unseen test data.
Test scores from different approaches are compared, and performance differences
are tested for statistical significance.
In this publication, we show that there is a high risk that a statistical
significance in this type of evaluation is not due to a superior learning
approach. Instead, there is a high risk that the difference is due to chance.
For example for the CoNLL 2003 NER dataset we observed in up to 26% of the
cases type I errors (false positives) with a threshold of p < 0.05, i.e.,
falsely concluding a statistically significant difference between two identical
approaches.
We prove that this evaluation setup is unsuitable to compare learning
approaches. We formalize alternative evaluation setups based on score
distributions
Parsing Argumentation Structures in Persuasive Essays
In this article, we present a novel approach for parsing argumentation
structures. We identify argument components using sequence labeling at the
token level and apply a new joint model for detecting argumentation structures.
The proposed model globally optimizes argument component types and
argumentative relations using integer linear programming. We show that our
model considerably improves the performance of base classifiers and
significantly outperforms challenging heuristic baselines. Moreover, we
introduce a novel corpus of persuasive essays annotated with argumentation
structures. We show that our annotation scheme and annotation guidelines
successfully guide human annotators to substantial agreement. This corpus and
the annotation guidelines are freely available for ensuring reproducibility and
to encourage future research in computational argumentation.Comment: Under review in Computational Linguistics. First submission: 26
October 2015. Revised submission: 15 July 201
Argumentation Mining in User-Generated Web Discourse
The goal of argumentation mining, an evolving research field in computational
linguistics, is to design methods capable of analyzing people's argumentation.
In this article, we go beyond the state of the art in several ways. (i) We deal
with actual Web data and take up the challenges given by the variety of
registers, multiple domains, and unrestricted noisy user-generated Web
discourse. (ii) We bridge the gap between normative argumentation theories and
argumentation phenomena encountered in actual data by adapting an argumentation
model tested in an extensive annotation study. (iii) We create a new gold
standard corpus (90k tokens in 340 documents) and experiment with several
machine learning methods to identify argument components. We offer the data,
source codes, and annotation guidelines to the community under free licenses.
Our findings show that argumentation mining in user-generated Web discourse is
a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in
User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
Aspect-Controlled Neural Argument Generation
We rely on arguments in our daily lives to deliver our opinions and base them
on evidence, making them more convincing in turn. However, finding and
formulating arguments can be challenging. In this work, we train a language
model for argument generation that can be controlled on a fine-grained level to
generate sentence-level arguments for a given topic, stance, and aspect. We
define argument aspect detection as a necessary method to allow this
fine-granular control and crowdsource a dataset with 5,032 arguments annotated
with aspects. Our evaluation shows that our generation model is able to
generate high-quality, aspect-specific arguments. Moreover, these arguments can
be used to improve the performance of stance detection models via data
augmentation and to generate counter-arguments. We publish all datasets and
code to fine-tune the language model
- …