Search CORE

77,116 research outputs found

Similarity-Based Models of Word Cooccurrence Probabilities

Author: Dagan Ido
Lee Lillian
Pereira Fernando C. N.
Publication venue
Publication date: 27/09/1998
Field of study

In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error. We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.Comment: 26 pages, 5 figure

arXiv.org e-Print Archive

CiteSeerX

A comparison of homonym meaning frequency estimates derived from movie and television subtitles, free association, and explicit ratings

Author: Armstrong Blair C.
Beekhuizen Barend
Dubrovsky Vladimir
Rice Caitlin A.
Stevenson Suzanne
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

First Online: 10 September 2018Most words are ambiguous, with interpretation dependent on context. Advancing theories of ambiguity resolution is important for any general theory of language processing, and for resolving inconsistencies in observed ambiguity effects across experimental tasks. Focusing on homonyms (words such as bank with unrelated meanings EDGE OF A RIVER vs. FINANCIAL INSTITUTION), the present work advances theories and methods for estimating the relative frequency of their meanings, a factor that shapes observed ambiguity effects. We develop a new method for estimating meaning frequency based on the meaning of a homonym evoked in lines of movie and television subtitles according to human raters. We also replicate and extend a measure of meaning frequency derived from the classification of free associates. We evaluate the internal consistency of these measures, compare them to published estimates based on explicit ratings of each meaning’s frequency, and compare each set of norms in predicting performance in lexical and semantic decision mega-studies. All measures have high internal consistency and show agreement, but each is also associated with unique variance, which may be explained by integrating cognitive theories of memory with the demands of different experimental methodologies. To derive frequency estimates, we collected manual classifications of 533 homonyms over 50,000 lines of subtitles, and of 357 homonyms across over 5000 homonym–associate pairs. This database—publicly available at: www.blairarmstrong.net/homonymnorms/—constitutes a novel resource for computational cognitive modeling and computational linguistics, and we offer suggestions around good practices for its use in training and testing models on labeled data

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

A broad-coverage distributed connectionist model of visual word recognition

Author: Baayen Prof R. Harald
Moscoso del Prado Martin Dr Fermin
Publication venue
Publication date: 01/01/2005
Field of study

In this study we describe a distributed connectionist model of morphological processing, covering a realistically sized sample of the English language. The purpose of this model is to explore how effects of discrete, hierarchically structured morphological paradigms, can arise as a result of the statistical sub-regularities in the mapping between word forms and word meanings. We present a model that learns to produce at its output a realistic semantic representation of a word, on presentation of a distributed representation of its orthography. After training, in three experiments, we compare the outputs of the model with the lexical decision latencies for large sets of English nouns and verbs. We show that the model has developed detailed representations of morphological structure, giving rise to effects analogous to those observed in visual lexical decision experiments. In addition, we show how the association between word form and word meaning also give rise to recently reported differences between regular and irregular verbs, even in their completely regular present-tense forms. We interpret these results as underlining the key importance for lexical processing of the statistical regularities in the mappings between form and meaning

CogPrints Cognitive Sciences Eprint Archive

A plea for more interactions between psycholinguistics and natural language processing research

Author: Brysbaert Marc
Keuleers Emmanuel
Mandera Pawel
Publication venue
Publication date: 01/01/2014
Field of study

A new development in psycholinguistics is the use of regression analyses on tens of thousands of words, known as the megastudy approach. This development has led to the collection of processing times and subjective ratings (of age of acquisition, concreteness, valence, and arousal) for most of the existing words in English and Dutch. In addition, a crowdsourcing study in the Dutch language has resulted in information about how well 52,000 lemmas are known. This information is likely to be of interest to NLP researchers and computational linguists. At the same time, large-scale measures of word characteristics developed in the latter traditions are likely to be pivotal in bringing the megastudy approach to the next level

Ghent University Academic Bibliography

The unexplained nature of reading.

Author: Adelman James S.
Estes Zachary
Marquis Suzanne J.
Sabatos-DeVito Maura G.
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2013
Field of study

The effects of properties of words on their reading aloud response times (RTs) are 1 major source of evidence about the reading process. The precision with which such RTs could potentially be predicted by word properties is critical to evaluate our understanding of reading but is often underestimated due to contamination from individual differences. We estimated this precision without such contamination individually for 4 people who each read 2,820 words 50 times each. These estimates were compared to the precision achieved by a 31-variable regression model that outperforms current cognitive models on variance-explained criteria. Most (around 2/3) of the meaningful (non-first-phoneme, non-noise) word-level variance remained unexplained by this model. Considerable empirical and theoretical-computational effort has been expended on this area of psychology, but the high level of systematic variance remaining unexplained suggests doubts regarding contemporary accounts of the details of the mechanisms of reading at the level of the word. Future assessment of models can take advantage of the availability of our precise participant-level database

Crossref

Archivio istituzionale della Ricerca - Bocconi

Warwick Research Archives Portal Repository

The dynamics of syntax acquisition: facilitation between syntactic structures

Author: Ben-Horin
Berman
Berman
Giora
Givón
Glinert
Greene
Homa
MICHAEL KEREN
Ninio
Nosofsky
Ravid
Ryan
TAMAR KEREN-PORTNOY
Ziv
Ziv
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/03/2011
Field of study

This paper sets out to show how facilitation between different clause structures operates over time in syntax acquisition. The phenomenon of facilitation within given structures has been widely documented, yet inter-structure facilitation has rarely been reported so far. Our findings are based on the naturalistic production corpora of six toddlers learning Hebrew as their first language. We use regression analysis, a method that has not been used to study this phenomenon. We find that the proportion of errors among the earliest produced clauses in a structure is related to the degree of acceleration of that structure's learning curve; that with the accretion of structures the proportion of errors among the first clauses of new structures declines, as does the acceleration of their learning curves. We interpret our findings as showing that learning new syntactic structures is made easier, or facilitated, by previously acquired ones

Crossref

White Rose Research Online

Recommended from our members

A Bayesian mixture model for term re-occurrence and burstiness

Author: De Roeck Anne
Garthwaite Paul
Sarkar Avik
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus

Open Research Online (The Open University)

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref