4 research outputs found
Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics Interface of LMs Through Agentivity
Recent advances in large language models have prompted researchers to examine
their abilities across a variety of linguistic tasks, but little has been done
to investigate how models handle the interactions in meaning across words and
larger syntactic forms -- i.e. phenomena at the intersection of syntax and
semantics. We present the semantic notion of agentivity as a case study for
probing such interactions. We created a novel evaluation dataset by utilitizing
the unique linguistic properties of a subset of optionally transitive English
verbs. This dataset was used to prompt varying sizes of three model classes to
see if they are sensitive to agentivity at the lexical level, and if they can
appropriately employ these word-level priors given a specific syntactic
context. Overall, GPT-3 text-davinci-003 performs extremely well across all
experiments, outperforming all other models tested by far. In fact, the results
are even better correlated with human judgements than both syntactic and
semantic corpus statistics. This suggests that LMs may potentially serve as
more useful tools for linguistic annotation, theory testing, and discovery than
select corpora for certain tasks
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models
Effectively using Natural Language Processing (NLP) tools in under-resourced
languages requires a thorough understanding of the language itself, familiarity
with the latest models and training methodologies, and technical expertise to
deploy these models. This could present a significant obstacle for language
community members and linguists to use NLP tools. This paper introduces the CMU
Linguistic Annotation Backend, an open-source framework that simplifies model
deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB
enables users to leverage the power of multilingual models to quickly adapt and
extend existing tools for speech recognition, OCR, translation, and syntactic
analysis to new languages, even with limited training data. We describe various
tools and APIs that are currently available and how developers can easily add
new models/functionality to the framework. Code is available at
https://github.com/neulab/cmulab along with a live demo at https://cmulab.devComment: Live demo at https://cmulab.de
Wav2Gloss: Generating Interlinear Glossed Text from Speech
Thousands of the world's languages are in danger of extinction--a tremendous
threat to cultural identities and human language diversity. Interlinear Glossed
Text (IGT) is a form of linguistic annotation that can support documentation
and resource creation for these languages' communities. IGT typically consists
of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4)
free translations to a majority language. We propose Wav2Gloss: a task in which
these four annotation components are extracted automatically from speech, and
introduce the first dataset to this end, Fieldwork: a corpus of speech with all
these annotations, derived from the work of field linguists, covering 37
languages, with standard formatting, and train/dev/test splits. We provide
various baselines to lay the groundwork for future research on IGT generation
from speech, such as end-to-end versus cascaded, monolingual versus
multilingual, and single-task versus multi-task approaches.Comment: ACL 2024 camera ready versio