43 research outputs found
Analysing Data-To-Text Generation Benchmarks
Recently, several data-sets associating data to text have been created to
train data-to-text surface realisers. It is unclear however to what extent the
surface realisation task exercised by these data-sets is linguistically
challenging. Do these data-sets provide enough variety to encourage the
development of generic, high-quality data-to-text surface realisers ? In this
paper, we argue that these data-sets have important drawbacks. We back up our
claim using statistics, metrics and manual evaluation. We conclude by eliciting
a set of criteria for the creation of a data-to-text benchmark which could help
better support the development, evaluation and comparison of linguistically
sophisticated data-to-text surface realisers
Learning Embeddings to lexicalise RDF Properties
International audienceA difficult task when generating text from knowledge bases (KB) consists in finding appropriate lexicalisations for KB symbols. We present an approach for lexicalis-ing knowledge base relations and apply it to DBPedia data. Our model learns low-dimensional embeddings of words and RDF resources and uses these representations to score RDF properties against candidate lexicalisations. Training our model using (i) pairs of RDF triples and automatically generated verbalisations of these triples and (ii) pairs of paraphrases extracted from various resources, yields competitive results on DBPedia data
A Statistical, Grammar-Based Approach to Microplanning
International audienceWhile there has been much work in recent years on data-driven natural language generation, little attention has been paid to the fine grained interactions that arise during micro-planning between aggregation, surface realization and sentence segmentation. In this paper, we propose a hybrid symbolic/statistical approach to jointly model these interactions. Our approach integrates a small handwritten grammar, a statistical hypertagger and a surface realization algorithm. It is applied to the verbalization of knowledge base queries and tested on 13 knowledge bases to demonstrate domain independence. We evaluate our approach in several ways. A quantitative analysis shows that the hybrid approach outperforms a purely symbolic approach in terms of both speed and coverage. Results from a human study indicate that users find the output of this hybrid statistic/symbolic system more fluent than both a template-and a purely symbolic grammar-based approach. Finally, we illustrate by means of examples that our approach can account for various factors impacting aggregation, sentence segmentation and surface realization
Deep Graph Convolutional Encoders for Structured Data to Text Generation
Most previous work on neural text generation from graph-structured data
relies on standard sequence-to-sequence methods. These approaches linearise the
input graph to be fed to a recurrent neural network. In this paper, we propose
an alternative encoder based on graph convolutional networks that directly
exploits the input structure. We report results on two graph-to-sequence
datasets that empirically show the benefits of explicitly encoding the input
graph structure.Comment: INLG 201
Creating Training Corpora for NLG Micro-Planning
International audienceIn this paper, we focus on how to create data-to-text corpora which can support the learning of wide-coverage micro-planners i.e., generation systems that handle lexicalisation, aggregation, surface re-alisation, sentence segmentation and referring expression generation. We start by reviewing common practice in designing training benchmarks for Natural Language Generation. We then present a novel framework for semi-automatically creating linguistically challenging NLG corpora from existing Knowledge Bases. We apply our framework to DBpedia data and compare the resulting dataset with (Wen et al., 2016)'s dataset. We show that while (Wen et al., 2016)'s dataset is more than twice larger than ours, it is less diverse both in terms of input and in terms of text. We thus propose our corpus generation framework as a novel method for creating challenging data sets from which NLG models can be learned which are capable of generating text from KB data
Bootstrapping Generators from Noisy Data
A core step in statistical data-to-text generation concerns learning
correspondences between structured data representations (e.g., facts in a
database) and associated texts. In this paper we aim to bootstrap generators
from large scale datasets where the data (e.g., DBPedia facts) and related
texts (e.g., Wikipedia abstracts) are loosely aligned. We tackle this
challenging task by introducing a special-purpose content selection mechanism.
We use multi-instance learning to automatically discover correspondences
between data and text pairs and show how these can be used to enhance the
content signal while training an encoder-decoder architecture. Experimental
results demonstrate that models trained with content-specific objectives
improve upon a vanilla encoder-decoder which solely relies on soft attention.Comment: NAACL 201
Using FB-LTAG Derivation Trees to Generate Transformation-Based Grammar Exercices
International audienceUsing a Feature-Based Lexicalised Tree Adjoining Grammar (FB-LTAG), we present an approach for generating pairs of sentences that are related by a syntactic transformation and we apply this approach to create language learning exercises. We argue that the derivation trees of an FB-LTAG provide a good level of representation for capturing syntactic transformations. We relate our approach to previous work on sentence reformulation, question generation and grammar exercise generation. We evaluate precision and linguistic coverage. And we demonstrate the genericity of the proposal by applying it to a range of transformations including the Passive/Active transformation, the pronominalisation of an NP, the assertion / yes-no question relation and the assertion / wh-question transformation
Generating Grammar Exercises
International audienceGrammar exercises for language learning fall into two distinct classes: those that are based on ''real life sentences'' extracted from existing documents or from the web; and those that seek to facilitate language acquisition by presenting the learner with exercises whose syntax is as simple as possible and whose vocabulary is restricted to that contained in the textbook being used. In this paper, we introduce a framework (called gramex) which permits generating the second type of grammar exercises. Using generation techniques, we show that a grammar can be used to semi-automatically generate grammar exercises which target a specific learning goal; are made of short, simple sentences; and whose vocabulary is restricted to that used in a given textbook
Using Regular Tree Grammars to enhance Sentence Realisation
International audienceFeature-based regular tree grammars (FRTG) can be used to generate the derivation trees of a feature-based tree adjoining grammar (FTAG). We make use of this fact to specify and implement both an FTAG-based sentence realiser and a benchmark generator for this realiser. We argue furthermore that the FRTG encoding enables us to improve on other proposals based on a grammar of TAG derivation trees in several ways. It preserves the compositional semantics that can be encoded in feature-based TAGs; it increases efficiency and restricts overgeneration; and it provides a uniform resource for generation, benchmark construction, and parsing
Building RDF Content for Data-to-Text Generation
International audienceIn Natural Language Generation (NLG), one important limitation is the lack of common benchmarks on which to train, evaluate and compare data-to-text generators. In this paper, we make one step in that direction and introduce a method for automatically creating an arbitrary large repertoire of data units that could serve as input for generation. Using both automated metrics and a human evaluation, we show that the data units produced by our method are both diverse and coherent