1,174,608 research outputs found
A large annotated corpus for learning natural language inference
Understanding entailment and contradiction is fundamental to understanding
natural language, and inference about entailment and contradiction is a
valuable testing ground for the development of semantic representations.
However, machine learning research in this area has been dramatically limited
by the lack of large-scale resources. To address this, we introduce the
Stanford Natural Language Inference corpus, a new, freely available collection
of labeled sentence pairs, written by humans doing a novel grounded task based
on image captioning. At 570K pairs, it is two orders of magnitude larger than
all other resources of its type. This increase in scale allows lexicalized
classifiers to outperform some sophisticated existing entailment models, and it
allows a neural network-based model to perform competitively on natural
language inference benchmarks for the first time.Comment: To appear at EMNLP 2015. The data will be posted shortly before the
conference (the week of 14 Sep) at http://nlp.stanford.edu/projects/snli
The crustal dynamics intelligent user interface anthology
The National Space Science Data Center (NSSDC) has initiated an Intelligent Data Management (IDM) research effort which has, as one of its components, the development of an Intelligent User Interface (IUI). The intent of the IUI is to develop a friendly and intelligent user interface service based on expert systems and natural language processing technologies. The purpose of such a service is to support the large number of potential scientific and engineering users that have need of space and land-related research and technical data, but have little or no experience in query languages or understanding of the information content or architecture of the databases of interest. This document presents the design concepts, development approach and evaluation of the performance of a prototype IUI system for the Crustal Dynamics Project Database, which was developed using a microcomputer-based expert system tool (M. 1), the natural language query processor THEMIS, and the graphics software system GSS. The IUI design is based on a multiple view representation of a database from both the user and database perspective, with intelligent processes to translate between the views
The Unstoppable Rise of Computational Linguistics in Deep Learning
In this paper, we trace the history of neural networks applied to natural
language understanding tasks, and identify key contributions which the nature
of language has made to the development of neural network architectures. We
focus on the importance of variable binding and its instantiation in
attention-based models, and argue that Transformer is not a sequence model but
an induced-structure model. This perspective leads to predictions of the
challenges facing research in deep learning architectures for natural language
understanding.Comment: 13 pages. Accepted for publication at ACL 2020, in the theme trac
A Family of Pretrained Transformer Language Models for Russian
Nowadays, Transformer language models (LMs) represent a fundamental component
of the NLP research methodologies and applications. However, the development of
such models specifically for the Russian language has received little
attention. This paper presents a collection of 13 Russian Transformer LMs based
on the encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and
encoder-decoder (ruT5, FRED-T5) models in multiple sizes. Access to these
models is readily available via the HuggingFace platform. We provide a report
of the model architecture design and pretraining, and the results of evaluating
their generalization abilities on Russian natural language understanding and
generation datasets and benchmarks. By pretraining and releasing these
specialized Transformer LMs, we hope to broaden the scope of the NLP research
directions and enable the development of industrial solutions for the Russian
language
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
Artificial Intelligence (AI), along with the recent progress in biomedical
language understanding, is gradually changing medical practice. With the
development of biomedical language understanding benchmarks, AI applications
are widely used in the medical field. However, most benchmarks are limited to
English, which makes it challenging to replicate many of the successes in
English for other languages. To facilitate research in this direction, we
collect real-world biomedical data and present the first Chinese Biomedical
Language Understanding Evaluation (CBLUE) benchmark: a collection of natural
language understanding tasks including named entity recognition, information
extraction, clinical diagnosis normalization, single-sentence/sentence-pair
classification, and an associated online platform for model evaluation,
comparison, and analysis. To establish evaluation on these tasks, we report
empirical results with the current 11 pre-trained Chinese models, and
experimental results show that state-of-the-art neural models perform by far
worse than the human ceiling. Our benchmark is released at
\url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}
KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding
Natural language inference (NLI) and semantic textual similarity (STS) are
key tasks in natural language understanding (NLU). Although several benchmark
datasets for those tasks have been released in English and a few other
languages, there are no publicly available NLI or STS datasets in the Korean
language. Motivated by this, we construct and release new datasets for Korean
NLI and STS, dubbed KorNLI and KorSTS, respectively. Following previous
approaches, we machine-translate existing English training sets and manually
translate development and test sets into Korean. To accelerate research on
Korean NLU, we also establish baselines on KorNLI and KorSTS. Our datasets are
publicly available at https://github.com/kakaobrain/KorNLUDatasets.Comment: Findings of EMNLP 2020. Datasets available at
https://github.com/kakaobrain/KorNLUDataset
- …