70 research outputs found
The Processing of Verb-Argument Constructions is Sensitive to Form, Function, Frequency, Contingency, and Prototypicality.
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/139741/1/CognitiveLinguisticsUMichOffprint.pd
Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures
In this work we build upon negative results from an attempt at language
modeling with predicted semantic structure, in order to establish empirical
lower bounds on what could have made the attempt successful. More specifically,
we design a concise binary vector representation of semantic structure at the
lexical level and evaluate in-depth how good an incremental tagger needs to be
in order to achieve better-than-baseline performance with an end-to-end
semantic-bootstrapping language model. We envision such a system as consisting
of a (pretrained) sequential-neural component and a hierarchical-symbolic
component working together to generate text with low surprisal and high
linguistic interpretability. We find that (a) dimensionality of the semantic
vector representation can be dramatically reduced without losing its main
advantages and (b) lower bounds on prediction quality cannot be established via
a single score alone, but need to take the distributions of signal and noise
into account.Comment: To appear at *SEM 2023, Toront
Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces.
Word embedding techniques heavily rely on the abundance of training data for individual words. Given the Zipfian distribution of words in natural language texts, a large number of words do not usually appear frequently or at all in the training data. In this paper we put forward a technique that exploits the knowledge encoded in lexical resources, such as WordNet, to induce embeddings for unseen words. Our approach adapts graph embedding and cross-lingual vector space transformation techniques in order to merge lexical knowledge encoded in ontologies with that derived from corpus statistics. We show that the approach can provide consistent performance improvements across multiple evaluation benchmarks: in-vitro, on multiple rare word similarity datasets, and in- vivo, in two downstream text classification tasks.MR
Typilus: Neural Type Hints
Type inference over partial contexts in dynamically typed languages is
challenging. In this work, we present a graph neural network model that
predicts types by probabilistically reasoning over a program's structure,
names, and patterns. The network uses deep similarity learning to learn a
TypeSpace -- a continuous relaxation of the discrete space of types -- and how
to embed the type properties of a symbol (i.e. identifier) into it.
Importantly, our model can employ one-shot learning to predict an open
vocabulary of types, including rare and user-defined ones. We realise our
approach in Typilus for Python that combines the TypeSpace with an optional
type checker. We show that Typilus accurately predicts types. Typilus
confidently predicts types for 70% of all annotatable symbols; when it predicts
a type, that type optionally type checks 95% of the time. Typilus can also find
incorrect type annotations; two important and popular open source libraries,
fairseq and allennlp, accepted our pull requests that fixed the annotation
errors Typilus discovered.Comment: Accepted to PLDI 202
Semi-supervised lexical acquisition for wide-coverage parsing
State-of-the-art parsers suffer from incomplete lexicons, as evidenced by the fact
that they all contain built-in methods for dealing with out-of-lexicon items at parse
time. Since new labelled data is expensive to produce and no amount of it will conquer
the long tail, we attempt to address this problem by leveraging the enormous amount of
raw text available for free, and expanding the lexicon offline, with a semi-supervised
word learner. We accomplish this with a method similar to self-training, where a fully
trained parser is used to generate new parses with which the next generation of parser
is trained.
This thesis introduces Chart Inference (CI), a two-phase word-learning method
with Combinatory Categorial Grammar (CCG), operating on the level of the partial
parse as produced by a trained parser. CI uses the parsing model and lexicon to identify
the CCG category type for one unknown word in a context of known words by inferring
the type of the sentence using a model of end punctuation, then traversing the chart
from the top down, filling in each empty cell as a function of its mother and its sister.
We first specify the CI algorithm, and then compare it to two baseline wordlearning
systems over a battery of learning tasks. CI is shown to outperform the
baselines in every task, and to function in a number of applications, including grammar
acquisition and domain adaptation. This method performs consistently better than
self-training, and improves upon the standard POS-backoff strategy employed by the
baseline StatCCG parser by adding new entries to the lexicon.
The first learning task establishes lexical convergence over a toy corpus, showing
that CI’s ability to accurately model a target lexicon is more robust to initial conditions
than either of the baseline methods. We then introduce a novel natural language corpus
based on children’s educational materials, which is fully annotated with CCG derivations.
We use this corpus as a testbed to establish that CI is capable in principle of
recovering the whole range of category types necessary for a wide-coverage lexicon.
The complexity of the learning task is then increased, using the CCGbank corpus,
a version of the Penn Treebank, and showing that CI improves as its initial seed corpus
is increased. The next experiment uses CCGbank as the seed and attempts to recover
missing question-type categories in the TREC question answering corpus. The final
task extends the coverage of the CCGbank-trained parser by running CI over the raw
text of the Gigaword corpus. Where appropriate, a fine-grained error analysis is also
undertaken to supplement the quantitative evaluation of the parser performance with
deeper reasoning as to the linguistic points of the lexicon and parsing model
- …