Search CORE

588 research outputs found

Human Associations Help to Detect Conventionalized Multiword Expressions

Author: Gerasimova Anastasia
Loukachevitch Natalia
Publication venue
Publication date: 12/09/2017
Field of study

In this paper we show that if we want to obtain human evidence about conventionalization of some phrases, we should ask native speakers about associations they have to a given phrase and its component words. We have shown that if component words of a phrase have each other as frequent associations, then this phrase can be considered as conventionalized. Another type of conventionalized phrases can be revealed using two factors: low entropy of phrase associations and low intersection of component word and phrase associations. The association experiments were performed for the Russian language

arXiv.org e-Print Archive

Crossref

Towards Comprehensive Computational Representations of Arabic Multiword Expressions

Author: Abdul Conteh (2667454)
Alessandro Lamorte (3761290)
Enrico Boero (3761296)
Marco Foletti (3761299)
Paola Crida (3761293)
Paolo Narcisi (3761287)
Publication venue: Springer
Publication date: 01/10/2016
Field of study

A successful computational treatment of multiword expressions (MWEs) in natural languages leads to a robust NLP system which considers the long-standing problem of language ambiguity caused primarily by this complex linguistic phenomenon. The first step in addressing this challenge is building an extensive reliable MWEs language resource LR with comprehensive computational representations across all linguistic levels. This forms the cornerstone in understanding the heterogeneous linguistic behaviour of MWEs in their various manifestations. This paper presents a detailed framework for computational representations of Arabic MWEs (ArMWEs) across all linguistic levels based on the state-of-the-art lexical mark-up framework (LMF) with the necessary modifications to suit the distinctive properties of Modern Standard Arabic (MSA). This work forms part of a larger project that aims to develop a comprehensive computational lexicon of ArMWEs for NLP and language pedagogy LP (JOMAL project)

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

White Rose Research Online

FigShare

Towards Comprehensive Computational Representations of Arabic Multiword Expressions

Author: G Francopoulo
G Francopoulo
IA Sag
J Odijk
K Bar
L Wanner
M Butt
M Palmer
MA Attia
T Arts
T Tanabe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Crossref

White Rose Research Online

Multiword expression processing: A survey

Author: Gülşen Eryiğit
Publication venue
Publication date: 01/12/2017
Field of study

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by "MWE processing," distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives

Open Access Repository

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

Author: Camacho-Collados Jose
Pilehvar Mohammad Taher
Publication venue
Publication date: 01/01/2018
Field of study

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page

arXiv.org e-Print Archive

Crossref

A Computational Lexicon and Representational Model for Arabic Multiword Expressions

Author: Alghamdi Ayman Ahmad O.
Publication venue: University of Leeds
Publication date: 01/10/2018
Field of study

The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

White Rose E-theses Online

Multiword Expressions We Live by: A Validated Usage-based Dataset from Corpora of Written Italian

Author: Castagnoli Sara
Masini Francesca
Micheli M. Silvia
Nissim Malvina
Zaninello Andrea
Publication venue
Publication date: 01/01/2020
Field of study

The paper describes the creation of a manually validated dataset of Italian multiword expressions, building on candidates automatically extracted from corpora of written Italian. The main features of the resource, such as POS-pattern and lemma distribution, are also discussed, together with possible applications

Archivio istituzionale della ricerca - Università di Macerata

Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions

Author: Abigail Walsh
Agata Savary
Archna Bhatia
Ashwini Vaidya
Bruno Guillaume
Carlos Ramisch
Chaya Liebeskind
Hongzhi Xu
Jakub Waszczuk
Marie Candito
Menghan Jiang
Monti Johanna
Renata Ramisch
Sara Stymne
Timm Lichte
Tunga Gungor
Uxoa I&#241
Verginica Barbu Mititelu
Voula Giouli
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Università degli Studi di Napoli L'Orientale: CINECA IRIS

Multiword Expressions We Live by:A Validated Usage-based Dataset from Corpora of Written Italian

Author: Castagnoli Sara
Masini Francesca
Micheli M. Silvia
Nissim Malvina
Zaninello Andrea
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

none5siThe paper describes the creation of a manually validated dataset of Italian multiword expressions, building on candidates automatically extracted from corpora of written Italian. The main features of the resource, such as POS-pattern and lemma distribution, are also discussed, together with possible applications.openFrancesca Masini, M. Silvia Micheli, Andrea Zaninello, Sara Castagnoli, Malvina NissimFrancesca Masini, M. Silvia Micheli, Andrea Zaninello, Sara Castagnoli, Malvina Nissi

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

OpenEdition

Dissertations of the University of Groningen