Search CORE

930 research outputs found

A Continuously Growing Dataset of Sentential Paraphrases

Author: He Hua
Lan Wuwei
Qiu Siyu
Xu Wei
Publication venue
Publication date: 01/01/2017
Field of study

A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201

arXiv.org e-Print Archive

Crossref

Extending the adverbial coverage of a NLP oriented resource for French

Author: Stavroula Voyatzi
Tolone Elsa
Publication venue
Publication date: 08/11/2011
Field of study

This paper presents a work on extending the adverbial entries of LGLex: a NLP oriented syntactic resource for French. Adverbs were extracted from the Lexicon-Grammar tables of both simple adverbs ending in -ment '-ly' (Molinier and Levrier, 2000) and compound adverbs (Gross, 1986; 1990). This work relies on the exploitation of fine-grained linguistic information provided in existing resources. Various features are encoded in both LG tables and they haven't been exploited yet. They describe the relations of deleting, permuting, intensifying and paraphrasing that associate, on the one hand, the simple and compound adverbs and, on the other hand, different types of compound adverbs. The resulting syntactic resource is manually evaluated and freely available under the LGPL-LR license.Comment: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP'11), Chiang Mai : Thailand (2011

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers

Author: Glavaš Goran
Gurevych Iryna
Lauscher Anne
Majewska Olga
Ribeiro Leonardo F. R.
Rozanov Nikolai
Publication venue
Publication date: 01/01/2020
Field of study

Following the major success of neural language models (LMs) such as BERT or GPT-2 on a variety of language understanding tasks, recent work focused on injecting (structured) knowledge from external resources into these models. While on the one hand, joint pretraining (i.e., training from scratch, adding objectives based on external knowledge to the primary LM objective) may be prohibitively computationally expensive, post-hoc fine-tuning on external knowledge, on the other hand, may lead to the catastrophic forgetting of distributional knowledge. In this work, we investigate models for complementing the distributional knowledge of BERT with conceptual knowledge from ConceptNet and its corresponding Open Mind Common Sense (OMCS) corpus, respectively, using adapter training. While overall results on the GLUE benchmark paint an inconclusive picture, a deeper analysis reveals that our adapter-based models substantially outperform BERT (up to 15-20 performance points) on inference tasks that require the type of conceptual knowledge explicitly present in ConceptNet and OMCS

arXiv.org e-Print Archive

TUbiblio

Crossref

MAnnheim DOCument Server

Question Paraphrase Generation for Question Answering System

Author: Qin Haocheng
Publication venue: 'University of Waterloo'
Publication date: 18/08/2015
Field of study

The queries to a practical Question Answering (QA) system range from keywords, phrases, badly written questions, and occasionally grammatically perfect questions. Among different kinds of question analysis approaches, the pattern matching works well in analyzing such queries. It is costly to build this pattern matching module because tremendous manual labor is needed to expand its coverage to so many variations in natural language questions. This thesis proposes that the costly manual labor should be saved by the technique of paraphrase generation which can automatically generate semantically similar paraphrases of a natural language question. Previous approaches of paraphrase generation either require large scale of corpus and the dependency parser, or only deal with the relation-entity type of simple question queries. By introducing a method of inferring transformation operations between paraphrases, and a description of sentence structure, this thesis develops a paraphrase generation method and its implementation in Chinese with very limited amount of corpus. The evaluation results of this implementation show its ability to aid humans to efficiently create a pattern matching module for QA systems as it greatly outperforms the human editors in the coverage of natural language questions, with an acceptable precision in generated paraphrases

University of Waterloo's Institutional Repository

Mining Social Science Publications for Survey Variables

Author: Mutschke Peter
Zielinski Andrea
Publication venue: 'Anatomische Gesellschaft'
Publication date: 01/01/2017
Field of study

Research in Social Science is usually based on survey data where individual research questions relate to observable concepts (variables). However, due to a lack of standards for data citations a reliable identification of the variables used is often difficult. In this paper, we present a work-in-progress study that seeks to provide a solution to the variable detection task based on supervised machine learning algorithms, using a linguistic analysis pipeline to extract a rich feature set, including terminological concepts and similarity metric scores. Further, we present preliminary results on a small dataset that has been specifically designed for this task, yielding modest improvements over the baseline

SSOAR - Social Science Open Access Repository

Learning Language from a Large (Unannotated) Corpus

Author: Goertzel Ben
Vepstas Linas
Publication venue
Publication date: 14/01/2014
Field of study

A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

arXiv.org e-Print Archive

CiteSeerX