65 research outputs found
Semi-Supervised Learning for Neural Keyphrase Generation
We study the problem of generating keyphrases that summarize the key points
for a given document. While sequence-to-sequence (seq2seq) models have achieved
remarkable performance on this task (Meng et al., 2017), model training often
relies on large amounts of labeled data, which is only applicable to
resource-rich domains. In this paper, we propose semi-supervised keyphrase
generation methods by leveraging both labeled data and large-scale unlabeled
samples for learning. Two strategies are proposed. First, unlabeled documents
are first tagged with synthetic keyphrases obtained from unsupervised keyphrase
extraction methods or a selflearning algorithm, and then combined with labeled
samples for training. Furthermore, we investigate a multi-task learning
framework to jointly learn to generate keyphrases as well as the titles of the
articles. Experimental results show that our semi-supervised learning-based
methods outperform a state-of-the-art model trained with labeled data only.Comment: To appear in EMNLP 2018 (12 pages, 7 figures, 6 tables
Predicting the Effectiveness of Self-Training: Application to Sentiment Classification
The goal of this paper is to investigate the connection between the
performance gain that can be obtained by selftraining and the similarity
between the corpora used in this approach. Self-training is a semi-supervised
technique designed to increase the performance of machine learning algorithms
by automatically classifying instances of a task and adding these as additional
training material to the same classifier. In the context of language processing
tasks, this training material is mostly an (annotated) corpus. Unfortunately
self-training does not always lead to a performance increase and whether it
will is largely unpredictable. We show that the similarity between corpora can
be used to identify those setups for which self-training can be beneficial. We
consider this research as a step in the process of developing a classifier that
is able to adapt itself to each new test corpus that it is presented with
Knowledge-Enhanced Multi-Label Few-Shot Product Attribute-Value Extraction
Existing attribute-value extraction (AVE) models require large quantities of
labeled data for training. However, new products with new attribute-value pairs
enter the market every day in real-world e-Commerce. Thus, we formulate AVE in
multi-label few-shot learning (FSL), aiming to extract unseen attribute value
pairs based on a small number of training examples. We propose a
Knowledge-Enhanced Attentive Framework (KEAF) based on prototypical networks,
leveraging the generated label description and category information to learn
more discriminative prototypes. Besides, KEAF integrates with hybrid attention
to reduce noise and capture more informative semantics for each class by
calculating the label-relevant and query-related weights. To achieve
multi-label inference, KEAF further learns a dynamic threshold by integrating
the semantic information from both the support set and the query set. Extensive
experiments with ablation studies conducted on two datasets demonstrate that
KEAF outperforms other SOTA models for information extraction in FSL. The code
can be found at: https://github.com/gjiaying/KEAFComment: 6 pages, 2 figures, published in CIKM 202
Innovative technologies for under-resourced language documentation: The BULB Project
International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping
Innovative technologies for under-resourced language documentation: The BULB Project
International audienceThe project Breaking the Unwritten Language Barrier (BULB), which brings together linguists and computer scientists, aims at supporting linguists in documenting unwritten languages. In order to achieve this we will develop tools tailored to the needs of documentary linguists by building upon technology and expertise from the area of natural language processing, most prominently automatic speech recognition and machine translation. As a development and test bed for this we have chosen three less-resourced African languages from the Bantu family: Basaa, Myene and Embosi. Work within the project is divided into three main steps: 1) Collection of a large corpus of speech (100h per language) at a reasonable cost. After initial recording, the data is re-spoken by a reference speaker to enhance the signal quality and orally translated into French. 2) Automatic transcription of the Bantu languages at phoneme level and the French translation at word level. The recognized Bantu phonemes and French words will then be automatically aligned. 3) Tool development. In close cooperation and discussion with the linguists, the speech and language technologists will design and implement tools that will support the linguists in their work, taking into account the linguists' needs and technology's capabilities. The data collection has begun for the three languages. For this we use standard mobile devices and a dedicated software—LIG-AIKUMA, which proposes a range of different speech collection modes (recording, respeaking, translation and elicitation). LIG-AIKUMA 's improved features include a smart generation and handling of speaker metadata as well as respeaking and parallel audio data mapping
- …