Search CORE

70 research outputs found

Workshop on Extracting and Using Constructions in Computational Linguistics

Author: Knutsson Ola
Sahlgren Magnus
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2010
Field of study

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

The Processing of Verb-Argument Constructions is Sensitive to Form, Function, Frequency, Contingency, and Prototypicality.

Author: Ellis Nick C.
O'Donnell M. B.
Römer U.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2014
Field of study

Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/139741/1/CognitiveLinguisticsUMichOffprint.pd

Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures

Author: Chersoni Emmanuele
Prange Jakob
Publication venue
Publication date: 30/05/2023
Field of study

In this work we build upon negative results from an attempt at language modeling with predicted semantic structure, in order to establish empirical lower bounds on what could have made the attempt successful. More specifically, we design a concise binary vector representation of semantic structure at the lexical level and evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance with an end-to-end semantic-bootstrapping language model. We envision such a system as consisting of a (pretrained) sequential-neural component and a hierarchical-symbolic component working together to generate text with low surprisal and high linguistic interpretability. We find that (a) dimensionality of the semantic vector representation can be dramatically reduced without losing its main advantages and (b) lower bounds on prediction quality cannot be established via a single score alone, but need to take the distributions of signal and noise into account.Comment: To appear at *SEM 2023, Toront

arXiv.org e-Print Archive

Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces.

Author: Collier Nigel
Kartsaklis Dimitri
Lio Pietro
Pilehvar Mohammad Taher
Prokhorov Victor
Publication venue: 'Organisation for Economic Co-Operation and Development (OECD)'
Publication date: 01/01/2019
Field of study

Word embedding techniques heavily rely on the abundance of training data for individual words. Given the Zipfian distribution of words in natural language texts, a large number of words do not usually appear frequently or at all in the training data. In this paper we put forward a technique that exploits the knowledge encoded in lexical resources, such as WordNet, to induce embeddings for unseen words. Our approach adapts graph embedding and cross-lingual vector space transformation techniques in order to merge lexical knowledge encoded in ontologies with that derived from corpus statistics. We show that the approach can provide consistent performance improvements across multiple evaluation benchmarks: in-vitro, on multiple rare word similarity datasets, and in- vivo, in two downstream text classification tasks.MR

Typilus: Neural Type Hints

Author: Allamanis Miltiadis
Allamanis Miltiadis
Alon Uri
Bahdanau Dzmitry
Bavishi Rohan
Bielik Pavol
Bracha Gilad
Brockschmidt Marc
Cho Kyunghyun
Contributors Spotify
Cvitkovic Milan
David Yaniv
DeFreez Daniel
DIRE
Foundation Python Software
Goodfellow Ian
Hadsell Raia
Karampatsis Rafael-Michael
Kim Yoon
Kipf Thomas N
Lacomis Jeremy
Li Yujia
Maddison Chris
Mangal Ravi
Miceli Barone Antonio Valerio
Overflow Stack
Vasic Marko
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 06/04/2020
Field of study

Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program's structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace -- a continuous relaxation of the discrete space of types -- and how to embed the type properties of a symbol (i.e. identifier) into it. Importantly, our model can employ one-shot learning to predict an open vocabulary of types, including rare and user-defined ones. We realise our approach in Typilus for Python that combines the TypeSpace with an optional type checker. We show that Typilus accurately predicts types. Typilus confidently predicts types for 70% of all annotatable symbols; when it predicts a type, that type optionally type checks 95% of the time. Typilus can also find incorrect type annotations; two important and popular open source libraries, fairseq and allennlp, accepted our pull requests that fixed the annotation errors Typilus discovered.Comment: Accepted to PLDI 202

arXiv.org e-Print Archive

Semi-supervised lexical acquisition for wide-coverage parsing

Author: Thomforde Emily Jane
Publication venue: The University of Edinburgh
Publication date: 02/07/2013
Field of study

State-of-the-art parsers suffer from incomplete lexicons, as evidenced by the fact that they all contain built-in methods for dealing with out-of-lexicon items at parse time. Since new labelled data is expensive to produce and no amount of it will conquer the long tail, we attempt to address this problem by leveraging the enormous amount of raw text available for free, and expanding the lexicon offline, with a semi-supervised word learner. We accomplish this with a method similar to self-training, where a fully trained parser is used to generate new parses with which the next generation of parser is trained. This thesis introduces Chart Inference (CI), a two-phase word-learning method with Combinatory Categorial Grammar (CCG), operating on the level of the partial parse as produced by a trained parser. CI uses the parsing model and lexicon to identify the CCG category type for one unknown word in a context of known words by inferring the type of the sentence using a model of end punctuation, then traversing the chart from the top down, filling in each empty cell as a function of its mother and its sister. We first specify the CI algorithm, and then compare it to two baseline wordlearning systems over a battery of learning tasks. CI is shown to outperform the baselines in every task, and to function in a number of applications, including grammar acquisition and domain adaptation. This method performs consistently better than self-training, and improves upon the standard POS-backoff strategy employed by the baseline StatCCG parser by adding new entries to the lexicon. The first learning task establishes lexical convergence over a toy corpus, showing that CI’s ability to accurately model a target lexicon is more robust to initial conditions than either of the baseline methods. We then introduce a novel natural language corpus based on children’s educational materials, which is fully annotated with CCG derivations. We use this corpus as a testbed to establish that CI is capable in principle of recovering the whole range of category types necessary for a wide-coverage lexicon. The complexity of the learning task is then increased, using the CCGbank corpus, a version of the Penn Treebank, and showing that CI improves as its initial seed corpus is increased. The next experiment uses CCGbank as the seed and attempts to recover missing question-type categories in the TREC question answering corpus. The final task extends the coverage of the CCGbank-trained parser by running CI over the raw text of the Gigaword corpus. Where appropriate, a fine-grained error analysis is also undertaken to supplement the quantitative evaluation of the parser performance with deeper reasoning as to the linguistic points of the lexicon and parsing model