Search CORE

54 research outputs found

Porting a lexicalized-grammar parser to the biomedical domain

Author: Clark Stephen
Rimell Laura
Publication venue: Elsevier Inc.
Publication date: 31/10/2009
Field of study

AbstractThis paper introduces a state-of-the-art, linguistically motivated statistical parser to the biomedical text mining community, and proposes a method of adapting it to the biomedical domain requiring only limited resources for data annotation. The parser was originally developed using the Penn Treebank and is therefore tuned to newspaper text. Our approach takes advantage of a lexicalized grammar formalism, Combinatory Categorial Grammar (ccg), to train the parser at a lower level of representation than full syntactic derivations. The ccg parser uses three levels of representation: a first level consisting of part-of-speech (pos) tags; a second level consisting of more fine-grained ccg lexical categories; and a third, hierarchical level consisting of ccg derivations. We find that simply retraining the pos tagger on biomedical data leads to a large improvement in parsing performance, and that using annotated data at the intermediate lexical category level of representation improves parsing accuracy further. We describe the procedure involved in evaluating the parser, and obtain accuracies for biomedical data in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical resource on which we evaluate. Our conclusion is that porting newspaper parsers to the biomedical domain, at least for parsers which use lexicalized grammars, may not be as difficult as first thought

Elsevier - Publisher Connector

VP-preposing and relative scope

Author: LUE THOMAS
RIMELL LAURA
Publication venue: ScholarlyCommons
Publication date: 01/01/2005
Field of study

ScholarlyCommons@Penn

Using Sentence Plausibility to Learn the Semantics of Transitive Verbs

Author: Clark Stephen
Polajnar Tamara
Rimell Laura
Publication venue
Publication date: 12/12/2014
Field of study

The functional approach to compositional distributional semantics considers transitive verbs to be linear maps that transform the distributional vectors representing nouns into a vector representing a sentence. We conduct an initial investigation that uses a matrix consisting of the parameters of a logistic regression classifier trained on a plausibility task as a transitive verb function. We compare our method to a commonly used corpus-based method for constructing a verb matrix and find that the plausibility training may be more effective for disambiguation tasks.Comment: Full updated paper for NIPS learning semantics workshop, with some minor errata fixe

arXiv.org e-Print Archive

CiteSeerX

A Natural Bias for Language Generation Models

Author: Kuncoro Adhiguna
Meister Clara
Pimentel Tiago
Rimell Laura
Stokowiec Wojciech
Yu Lei
Publication venue
Publication date: 23/06/2023
Field of study

After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, making it difficult to estimate the probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a heuristic raises the question: Can we initialise our models with this behaviour and save precious compute resources and model capacity? Here we show that we can effectively endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge, simply by initialising the bias term in a model's final linear layer with the log-unigram distribution. We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly (iii) appears to disentangle strong frequency effects by encouraging the model to specialise in non-frequency-related aspects of language.Comment: Main conference paper at ACL 202

arXiv.org e-Print Archive

D6.2 Integrated Final Version of the Components for Lexical Acquisition

Author: Bel N?ria
Frontini Francesca
Monachini Monica
Padr? Muntsa
Quochi Valeria
Rimell Laura
Publication venue
Publication date
Field of study

The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

PUblication MAnagement

Words, concepts, and the geometry of analogy

Author: Agres
Aristotle
Baroni
Barsalou
Bengio
Bordes
Clark
Collobert
Cooper
Davidson
Dennett
Derrac
Dimitrios Kartsaklis
Fodor
Geraint Wiggins
Gärdenfors
Gärdenfors
Haspelmath
Hesse
Laura Rimell
Levy
Martha Lewis
Matthew Purver
McGregor
Mikolov
Mikolov
Montague
Pennington
Rimell
Salton
Schütze
Sperber
Stephen McGregor
Turney
Widdows
Publication venue: 'Open Publishing Association'
Publication date: 01/08/2016
Field of study

In Proceedings SLPCS 2016, arXiv:1608.01018In Proceedings SLPCS 2016, arXiv:1608.01018In Proceedings SLPCS 2016, arXiv:1608.01018© S. McGregor, M. Purver & G. Wiggins. This paper presents a geometric approach to the problem of modelling the relationship between words and concepts, focusing in particular on analogical phenomena in language and cognition. Grounded in recent theories regarding geometric conceptual spaces, we begin with an analysis of existing static distributional semantic models and move on to an exploration of a dynamic approach to using high dimensional spaces of word meaning to project subspaces where analogies can potentially be solved in an online, contextualised way. The crucial element of this analysis is the positioning of statistics in a geometric environment replete with opportunities for interpretation

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Queen Mary Research Online