6 research outputs found
Multilingual collocation extraction with a syntactic parser
An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP application
Recommended from our members
Using linguistic data for English and Spanish verb-noun combination identification
We present a linguistic analysis of a set of English and Spanish verb+noun combinations (VNCs), and a method to use this information to improve VNC identification. Firstly, a sample of frequent VNCs are analysed in-depth and tagged along lexico-semantic and morphosyntactic dimensions, obtaining satisfactory inter-annotator agreement scores. Then, a VNC identification experiment is undertaken, where the analysed linguistic data is combined with chunking information and syntactic dependencies. A comparison between the results of the experiment and the results obtained by a basic detection method shows that VNC identification can be greatly improved by using linguistic information, as a large number of additional occurrences are detected with high precision
A MWE Acquisition and Lexicon Builder Web Service
This paper describes the development of a web-service tool for the automatic extraction of Multi-word expressions lexicons, which has been integrated in a distributed platform for the automatic creation of linguistic resources. The main purpose of the work described is thus to provide a (computationally "light") tool that produces a full lexical resource: multi-word terms/items with relevant and useful attached information that can be used for more complex processing tasks and applications (e.g. parsing, MT, IE, query expansion, etc.). The output of our tool is a MW lexicon formatted and encoded in XML according to the Lexical Mark-up Framework. The tool is already functional and available as a service. Evaluation experiments show that the tool precision is of about 80%
Dynamic resonance and social reciprocity in language change:The case of Good morrow
Entrenchment (i.e. Langacker, 1987) does not necessarily lead to predictable behaviour. This study aims at complementing the usage-based model of language change by oper- ationalising the role of dialogic creativity as a mechanism that can be in competition with conventionalization and grammaticalization. We provide a distinctive collexeme analysis (i.e. Hilpert, 2006) focussing on the constructionalization of the dialogic pair [A: good morrow B e B: (good) morrow (A)] from the 15th up to the 18th century. After reaching the highest degree of entrenchment and automatisation, the dialogic pair will show an increasing tendency to be creatively re-modelled with ad-hoc meanings during online exchanges by means of dynamic resonance (Du Bois, 2014) and non-reciprocal behaviour. We define this creative process of large-scale alteration as entrenchment inhibition. From our data it will emerge that entrenchment inhibition is triggered by spontaneous attempts of producing a creative ‘surplus’ over the expected social reciprocity (Gouldner, 1960) of conventionalized exchanges. This tendency will be shown to be driven by marked attempts of polite and impolite behaviour
D6.1: Technologies and Tools for Lexical Acquisition
This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)