Search CORE

146 research outputs found

Grammar induction for mildly context sensitive languages using variational Bayesian inference

Author: Bergen Leon
Bruno Chris
Harasim Daniel
O'Donnell Timothy J.
Portelance Eva
Publication venue
Publication date: 06/08/2014
Field of study

The following technical report presents a formal approach to probabilistic minimalist grammar induction. We describe a formalization of a minimalist grammar. Based on this grammar, we define a generative model for minimalist derivations. We then present a generalized algorithm for the application of variational Bayesian inference to lexicalized mildly context sensitive language grammars which in this paper is applied to the previously defined minimalist grammar

arXiv.org e-Print Archive

Dryad Digital Repository (Duke University)

Learning Structured Preferences

Author: Bergen Leon
Evans Owain Rhys
Tenenbaum Joshua B
Publication venue: Cognitive Science Society
Publication date: 08/12/2017
Field of study

Learning the preferences of other people is crucial for predict- ing future behavior. Both children and adults make inferences about others’ preferences from sparse data and in situations where the preferences have complex internal structures. We present a computational model of learning structured prefer- ences which integrates Bayesian inference and utility-based models of preference from economics. We experimentally test this model with adult participants, and compare the model to alternative heuristic models

DSpace@MIT

Nonliteral understanding of number words

Author: Bergen Leon
Goodman Noah D.
Kao Justine T.
Wu Jean Y.
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/04/2014
Field of study

One of the most puzzling and important facts about communication is that people do not always mean what they say; speakers often use imprecise, exaggerated, or otherwise literally false descriptions to communicate experiences and attitudes. Here, we focus on the nonliteral interpretation of number words, in particular hyperbole (interpreting unlikely numbers as exaggerated and conveying affect) and pragmatic halo (interpreting round numbers imprecisely). We provide a computational model of number interpretation as social inference regarding the communicative goal, meaning, and affective subtext of an utterance. We show that our model predicts humans’ interpretation of number words with high accuracy. Our model is the first to our knowledge to incorporate principles of communication and empirically measured background knowledge to quantitatively predict hyperbolic and pragmatic halo effects in number interpretation. This modeling framework provides a unified approach to nonliteral language understanding more generally.National Science Foundation (U.S.). Graduate Research Fellowship Progra

DSpace@MIT

Crossref

PubMed Central

DORIS-MAE: Scientific Document Retrieval using Multi-level Aspect-based Queries

Author: Bergen Leon
Naidu Prudhviraj
Paturi Ramamohan
Wang Jianyou
Wang Kaicheng
Wang Xiaoyue
Publication venue
Publication date: 28/10/2023
Field of study

In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high cost and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research. We developed a benchmark dataset within the field of computer science, consisting of 100 human-authored complex query cases. For each complex query, we assembled a collection of 100 relevant documents and produced annotated relevance scores for ranking them. Recognizing the significant labor of expert annotation, we also introduce Anno-GPT, a scalable framework for validating the performance of Large Language Models (LLMs) on expert-level dataset annotation tasks. LLM annotation of the DORIS-MAE dataset resulted in a 500x reduction in cost, without compromising quality. Furthermore, due to the multi-tiered structure of these complex queries, the DORIS-MAE dataset can be extended to over 4,000 sub-query test cases without requiring additional annotation. We evaluated 17 recent retrieval methods on DORIS-MAE, observing notable performance drops compared to traditional datasets. This highlights the need for better approaches to handle complex, multifaceted queries in scientific research. Our dataset and codebase are available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset.Comment: To appear in NeurIPS 2023 Datasets and Benchmarks Trac

arXiv.org e-Print Archive

Simplicity and learning to distinguish arguments from modifiers

Author: Edward Gibson
Leon Bergen
Timothy J. O'Donnell
Publication venue: 'Institute of Computer Science, Polish Academy of Sciences'
Publication date: 01/04/2023
Field of study

We present a learnability analysis of the argument-modifier distinction, asking whether there is information in the distribution of English constituents that could allow learners to identify which constituents are arguments and which are modifiers. We first develop a general description of some of the ways in which arguments and modifiers differ in distribution. We then identify two models from the literature that can capture these differences, which we call the argument-only model and the argument-modifier model. We employ these models using a common learning framework based on two simplicity biases which tradeoff against one another. The first bias favors a small lexicon with highly reusable lexical items, and the second, opposing, bias favors simple derivations of individual forms – those using small numbers of lexical items. Our first empirical study shows that the argument-modifier model is able to recover the argument-modifier status of many individual constituents when evaluated against a gold standard. This provides evidence in favor of our general account of the distributional differences between arguments and modifiers. It also suggests a kind of lower bound on the amount of information that a suitably equipped learner could use to identify which phrases are arguments or modifiers. We then present a series of analyses investigating how and why the argument-modifier model is able to recover the argument-modifier status of some constituents. In particular, we show that the argumentmodifier model is able to provide a simpler description of the input corpus than the argument-only model, both in terms of lexicon size, and in terms of the complexity of individual derivations. Intuitively, the argument-modifier model is able to do this because it is able to ignore spurious modifier structure when learning the lexicon. These analyses further support our general account of the differences between arguments and modifiers, as well as our simplicity-based approach to learning

Directory of Open Access Journals

How efficiency shapes human language

Author: Bergen Leon
Dautriche Isabelle
Futrell Richard
Gibson Edward
Levy Roger
Mahowald Kyle
Piandadosi Steven T.
Publication venue: 'Elsevier BV'
Publication date: 18/04/2019
Field of study

Crossref

Edinburgh Research Explorer

Color naming across languages reflects color use

Author: Bergen Leon
Conway Bevil R.
Futrell Richard Landy Jones
Gibson Edward A
Gibson Mitchell
Jara-Ettinger Julian
Mahowald Kyle Adam
Piantadosi Steven T.
Ratnasingam Sivalogeswaran
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 18/09/2017
Field of study

What determines how languages categorize colors? We analyzed results of the World Color Survey (WCS) of 110 languages to show that despite gross differences across languages, communication of chromatic chips is always better for warm colors (yellows/reds) than cool colors (blues/greens). We present an analysis of color statistics in a large databank of natural images curated by human observers for salient objects and show that objects tend to have warm rather than cool colors. These results suggest that the cross-linguistic similarity in color-naming efficiency reflects colors of universal usefulness and provide an account of a principle (color use) that governs how color categories come about. We show that potential methodological issues with the WCS do not corrupt information-theoretic analyses, by collecting original data using two extreme versions of the color-naming task, in three groups: the Tsimane’, a remote Amazonian hunter-gatherer isolate; Bolivian-Spanish speakers; and English speakers. These data also enabled us to test another prediction of the color-usefulness hypothesis: that differences in color categorization between languages are caused by differences in overall usefulness of color to a culture. In support, we found that color naming among Tsimane’ had relatively low communicative efficiency, and the Tsimane’ were less likely to use color terms when describing familiar objects. Color-naming among Tsimane’ was boosted when naming artificially colored objects compared with natural objects, suggesting that industrialization promotes color usefulness.National Science Foundation (U.S.) (Award 1534318

DSpace@MIT

Crossref

Accelerating slip rates on the Puente Hills blind thrust fault system beneath metropolitan Los Angeles, California, USA

Author: Argus
Athy
Daniel J. Ponti
Edward J. Rhodes
Eric Morrow
Grothe
James F. Dolan
John H. Shaw
Kristian J. Bergen
Leon
Lewis A. Owen
Lorraine A. Leon
Madhav K. Murari
Pratt
Shaw
Stein
Thomas L. Pratt
Wells
Wendy Barrera
Youngs
Publication venue: 'Geological Society of America'
Publication date: 09/01/2017
Field of study

Slip rates represent the average displacement across a fault over time and are essential to estimating earthquake recurrence for probabilistic seismic hazard assessments. We demonstrate that the slip rate on the western segment of the Puente Hills blind thrust fault system, which is beneath downtown Los Angeles, California (USA), has accelerated from ∼0.22 mm/yr in the late Pleistocene to ∼1.33 mm/yr in the Holocene. Our analysis is based on syntectonic strata derived from the Los Angeles River, which has continuously buried a fold scarp above the blind thrust. Slip on the fault beneath our field site began during the late-middle Pleistocene and progressively increased into the Holocene. This increase in rate implies that the magnitudes and/or the frequency of earthquakes on this fault segment have increased over time. This challenges the characteristic earthquake model and presents an evolving and potentially increasing seismic hazard to metropolitan Los Angeles

Crossref

White Rose Research Online