272 research outputs found

    Automatic Acquisition of Knowledge About Multiword Predicates

    Get PDF
    PACLIC 19 / Taipei, taiwan / December 1-3, 200

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection

    Full text link
    Compositionality in language refers to how much the meaning of some phrase can be decomposed into the meaning of its constituents and the way these constituents are combined. Based on the premise that substitution by synonyms is meaning-preserving, compositionality can be approximated as the semantic similarity between a phrase and a version of that phrase where words have been replaced by their synonyms. Different ways of representing such phrases exist (e.g., vectors [1] or language models [2]), and the choice of representation affects the measurement of semantic similarity. We propose a new compositionality detection method that represents phrases as ranked lists of term weights. Our method approximates the semantic similarity between two ranked list representations using a range of well-known distance and correlation metrics. In contrast to most state-of-the-art approaches in compositionality detection, our method is completely unsupervised. Experiments with a publicly available dataset of 1048 human-annotated phrases shows that, compared to strong supervised baselines, our approach provides superior measurement of compositionality using any of the distance and correlation metrics considered

    Discovering multiword expressions

    Get PDF
    In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods

    Extended papers from the MWE 2017 workshop

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

    Get PDF
    Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results  demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and  rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay  corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score

    Collocational processing in typologically different languages, English and Turkish::Evidence from corpora and psycholinguistic experimentation

    Get PDF
    Unlike the traditional words-and-rules approach to language processing (Pinker, 1999), usage-based models of language have emphasised the role of multi-word sequences (Christiansen & Chater, 2016b; Ellis, 2002). Various psycholinguistic experiments have demonstrated that multi-word sequences (MWS) are processed quantitatively faster than novel phrases by both L1 and L2 speakers (e.g. Arnon & Snider, 2010; Wolter & Yamashita, 2018). Collocations, a specific type of MWS, hold a prominent position in psycholinguistics, corpus linguistics and language pedagogy research. (Gablasova, Brezina, McEnery, 2017a). In this dissertation, I explored the processing of adjective-noun collocations in Turkish and English by L1 speakers of these languages through a corpus-based study and psycholinguistic experiments. Turkish is an agglutinating language with a rich morphology, it is therefore valid to ask if agglutinating structure of Turkish affects collocational processing in L1 Turkish and whether the same factors affect the processing of collocations in English and Turkish. In addition, this study looked at L1 and L2 processing of collocations in English. This thesis firstly has investigated the frequency counts and associations statistics of English and Turkish adjective-noun collocations through a corpus-based analysis of general reference corpora of English and Turkish. The corpus study showed that unlemmatised collocations, which does not take into account the inflected forms of the collocations, have similar mean frequency and association counts in the both languages. This suggests that the base forms – uninflected forms of the collocations in English and Turkish do not appear to have notably different frequency and association counts from each other. To test the effect of agglutinating structure of Turkish on the collocability of adjectives and nouns, the lemmatised forms of the collocations in the both languages were examined. In other words, collocations in the two languages were lemmatised. The lemmatisation brings the benefit of including the frequency counts of both the base and inflected forms of the collocations. The findings indicated that the vast majority (%75) of the lemmatised Turkish adjective-noun combinations occur at a higher-frequency than their English equivalents. In addition, agglutinating structure of Turkish appears to increase adjective-noun collocations’ association scores in the both frequency bands since the vast majority of Turkish collocations reach higher scores of collocational strengths than their unlemmatised forms. After the corpus study, I designed psycholinguistic experiments to explore the sensitivity of speakers of these languages to the frequency of adjectives, nouns and whole collocations in acceptability judgment tasks in English and Turkish. Mixed-effects regression modelling revealed that collocations which have similar collocational frequency and association scores are processed at comparable speeds in English and Turkish by L1 speakers of these languages. That is to say, both Turkish and English speakers are sensitive to the collocation frequency counts. This finding is in line with many previous empirical studies that language users process MWS quantitively faster than control phrases (e.g. Arnon & Snider, 2010; McDonald & Shillcock, 2003; Vilkaite, 2016). However, lemmatised collocation frequency counts affected the processing of Turkish and English collocations differently, and Turkish speakers appeared to attend to word-level frequency counts of collocations to a lesser extent than English speakers. These findings suggest that different mechanisms underlie L1 processing of English and Turkish collocations. The present study also looked at the sensitivity of L1 and L2 advanced speakers to the frequency of adjectives, nouns and whole collocations in English. Mixed-effects regression modelling revealed that L2 advanced speakers are sensitive to the collocation frequency counts like L1 English speakers because as the collocation frequency counts increased, L1 Turkish-English L2 speakers responded to the collocations in English more quickly, as L1 English speakers did. The results indicated that both groups showed sensitivity to noun frequency counts, and L2 English advanced speakers did not appear to rely on the noun frequency scores more heavily than the L1 English group while processing adjective-noun collocations. These findings are in conflict with the claims that L2 speakers process MWS differently than L1 speakers (Wray, 2002)
    • …
    corecore