417 research outputs found

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Lexicons and grammars for language processing: industrial or handcrafted products?

    Get PDF
    Lexicon and Grammar: From Meanings to the Construction of SignificationDuring the recent years, the use of linguistic data for language processing (semantic ambiguityresolution, translation...) increased progressively. Such data are now commonly called languageresources. A few years ago, nearly all the language resources used for this purpose were collectionsof texts as the Brown Corpus and the Penn Treebank, but the use of electronic lexicons (WordNet,FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Thisdevelopment is slow because most processes of construction of lexicons and grammars aremanual, whereas the construction of corpora has always been highly automated.However, more and more specialists of language processing realize that the information content oflexicons and grammars is richer than that of corpora, and hence the former make more elaborateprocessing possible. The difference in construction time is likely to be connected with thedifference in information content: the handcrafting of lexicons and grammars by linguists wouldmake them more informative than automatically generated data.This situation can evolve into two directions: either specialists of language technology getprogressively used to handling manually constructed resources, which are more informative andmore complex, or the process of construction of lexicons and grammars is automated andindustrialized, which is the mainstream perspective. Both evolutions are already in progress, and atension exists between them. The relation between linguists and computer scientists depends on thefuture of these evolutions, since the first implies training and hiring numerous linguists, whereasthe other depends essentially on solutions elaborated by computer engineers.The aim of this article is to analyse practical examples of the language resources in question, andto discuss about which of the two trends, handcrafting or generating industrially, or a combinationof both, can give the best results or is the most realistic.L'utilisation de données linguistiques pour le traitement des langues : levée d'ambiguïtés sémantiques, traduction... a augmenté progressivement au cours des dernières années. De telles données sont communément appelées ressources linguistiques. Il y a quelques années, presque toutes les ressources linguistiques exploitées pour ce type d'usage étaient des collections de textes telles que le Corpus de Brown et le Corpus arboré de Penn, mais l'utilisation de lexiques électroniques (WordNet, FrameNet, VerbNet, ComLex, Lexique-Grammaire...) et de grammaires formelles (grammaires d'adjonction d'arbres...) s'est développé depuis. Cet essor est lent, car la plupart des processus de construction de lexiques et de grammaires sont manuels, alors que la construction de corpus a été très tôt en grande partie automatisée. Cependant, de plus en plus de spécialistes du traitement des langues jugent le contenu informatif des lexiques et des grammaires plus riche que celui des corpus, ce qui ouvre la possibilité de traitements plus élaborés. La différence dans la durée de construction de ces deux types de ressources est sans doute liée à la différence de richesse du contenu informatif : la construction artisanale de lexiques et de grammaires par les linguistes les rendrait plus informatifs que des données engendrées automatiquement.Cette situation peut évoluer dans deux directions : ou les spécialistes de technologie linguistique se familiarisent progressivement avec la manipulation de ressources construites manuellement, plus informatives et plus complexes, ou les processus de construction de lexiques et de grammaires sont automatisés et industrialisés, ce qui est la perspective la plus répandue.Les deux évolutions sont déjà à l'œuvre, et il existe une tension entre elles deux. Les relations entre linguistes et informaticiens dépendent du futur de ces évolutions, puisque celle-là suppose la formation et le recrutement de nombreux linguistes, alors que celle-ci dépend essentiellement de solutions élaborées par des ingénieurs de l'informatique.Le but de cet article est d'analyser des exemples pratiques des ressources linguistiques en question, et de discuter sur la question de savoir laquelle des deux tendances, l'artisanale ou l'industrielle, ou une combinaison des deux, pourrait donner les meilleurs résultats ou s'avérer la plus réaliste

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    Formulaic language

    Get PDF
    The notion of formulaicity has received increasing attention in disciplines and areas as diverse as linguistics, literary studies, art theory and art history. In recent years, linguistic studies of formulaicity have been flourishing and the very notion of formulaicity has been approached from various methodological and theoretical perspectives and with various purposes in mind. The linguistic approach to formulaicity is still in a state of rapid development and the objective of the current volume is to present the current explorations in the field. Papers collected in the volume make numerous suggestions for further development of the field and they are arranged into three complementary parts. The first part, with three chapters, presents new theoretical and methodological insights as well as their practical application in the development of custom-designed software tools for identification and exploration of formulaic language in texts. Two papers in the second part explore formulaic language in the context of language learning. Finally, the third part, with three chapters, showcases descriptive research on formulaic language conducted primarily from the perspectives of corpus linguistics and translation studies. The volume will be of interest to anyone involved in the study of formulaic language either from a theoretical or a practical perspective

    Personality and culture in the Arab-Levant

    Get PDF
    • …
    corecore