2,394 research outputs found
Why We Need New Evaluation Metrics for NLG
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In
this paper, we motivate the need for novel, system- and data-independent
automatic evaluation methods: We investigate a wide range of metrics, including
state-of-the-art word-based and novel grammar-based ones, and demonstrate that
they only weakly reflect human judgements of system outputs as generated by
data-driven, end-to-end NLG. We also show that metric performance is data- and
system-specific. Nevertheless, our results also suggest that automatic metrics
perform reliably at system-level and can support system development by finding
cases where a system performs poorly.Comment: accepted to EMNLP 201
Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems
A growing number of applications, e.g. video surveillance and medical image
analysis, require training recognition systems from large amounts of weakly
annotated data while some targeted interactions with a domain expert are
allowed to improve the training process. In such cases, active learning (AL)
can reduce labeling costs for training a classifier by querying the expert to
provide the labels of most informative instances. This paper focuses on AL
methods for instance classification problems in multiple instance learning
(MIL), where data is arranged into sets, called bags, that are weakly labeled.
Most AL methods focus on single instance learning problems. These methods are
not suitable for MIL problems because they cannot account for the bag structure
of data. In this paper, new methods for bag-level aggregation of instance
informativeness are proposed for multiple instance active learning (MIAL). The
\textit{aggregated informativeness} method identifies the most informative
instances based on classifier uncertainty, and queries bags incorporating the
most information. The other proposed method, called \textit{cluster-based
aggregative sampling}, clusters data hierarchically in the instance space. The
informativeness of instances is assessed by considering bag labels, inferred
instance labels, and the proportion of labels that remain to be discovered in
clusters. Both proposed methods significantly outperform reference methods in
extensive experiments using benchmark data from several application domains.
Results indicate that using an appropriate strategy to address MIAL problems
yields a significant reduction in the number of queries needed to achieve the
same level of performance as single instance AL methods
Recommended from our members
On stopwords, filtering and data sparsity for sentiment analysis of Twitter
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier’s feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
When is multitask learning effective? Semantic sequence prediction under varying data conditions
Multitask learning has been applied successfully to a range of tasks, mostly
morphosyntactic. However, little is known on when MTL works and whether there
are data characteristics that help to determine its success. In this paper we
evaluate a range of semantic sequence labeling tasks in a MTL setup. We examine
different auxiliary tasks, amongst which a novel setup, and correlate their
impact to data-dependent conditions. Our results show that MTL is not always
effective, significant improvements are obtained only for 1 out of 5 tasks.
When successful, auxiliary tasks with compact and more uniform label
distributions are preferable.Comment: In EACL 201
- …