36 research outputs found

    A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions

    Get PDF
    In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems

    APPLICATION OF LINK GRAMMAR IN SEMI-SUPERVISED NAMED ENTITY RECOGNITION FOR ACCIDENT DOMAIN

    Get PDF
    Accident document typically contains some crucial information that might be useful for analysis process for future accident investigation i.e. date and time when the accident happened, location where the accident occurred and also the person involved in the accident. This document is largely available in free text; it can be in the form of news wire articles or accident reports. Although it is possible to identify the information manually, due to the high volumes of data involved, this task can be time consuming and prone to error. Information Extraction (IE) has been identified as a potential solution to this problem. IE has the ability to extract crucial information from unstructured texts and convert them into a more structured representation. This research is attempted to explore Name Entity Recognition (NER), one of the important tasks in IE research aimed to identify and classify entities in the text documents into some predefined categories. Numerous related research works on IE and NER have been published and commercialized. However, to the best of our knowledge, there exists only a handful of IE research works that are really focused on accident domain. In addition, none of these works have attempted to either explore or focus on NER, which becomes the main motivation for this research. The work presented in this thesis proposed an NER approach for accident documents that applies syntactical and word features in combination with Self-Training algorithm. In order to satisfy the research objectives, this thesis comes with three main contributions. The first contribution is the identification of the entity boundary. Entity segmentation or identification of entity boundary is required since named entity may consist of one or more words. We adopted Stanford Part-of-Speech (POS) tagger for the word POS tag and connectors from the Link Grammar (LG) parser to determine the starting and stopping word. The second contribution is the extraction pattern construction. Each named entity candidate will be assigned with an extraction pattern constructed from a set of word and syntactical feature. Current NER system used restricted syntactical features which are associated with a number of limitations. It is therefore a great challenge to propose a new NER approach using syntactical features that could capture all syntactical structure in a sentence. For the third contribution, we have applied the Self-Training algorithm which is one of the semi-supervised machines learning technique. The algorithm is utilized for predicting a huge set of unlabeled data, given a small number of labelled data. In our research, extraction pattern from the first module will be fed to this algorithm and is used to make the prediction of named entity candidate category. The Self-Training algorithm greatly benefits semi-supervised learning which allows classification of entities given only a small-size of labelled data. The algorithm reduces the training efforts and generates almost similar result as compared to the conventional supervised learning technique. The proposed system was tested on 100 accident news from Reuters to recognize three different named entities: date, person and location which are universally accepted categories in most NER applications. Exact Match evaluation method which consists of three evaluation metrics; precision, recall and F-measure is used to measure the proposed system performance against three existing NER systems. The proposed system has successfully outperforms one of those systems with an overall F-measure of approximately 9% but in the other hand it shows a slight decrease as compared to other two systems identified in our benchmarking. However, we believe that this difference is due to the different nature and techniques used in the three systems. We consider our semi-supervised approach as a promising method even though only two features are utilized: syntactical and word features. Further manual inspection during the experiments suggested that by using complete word and syntactical features or combination of these features with other features such as the semantic feature, would yield an improved result

    Accepting Preposition-Stranding under Sluicing Cross-linguistically; a Noisy-Channel Approach

    Get PDF
    This thesis investigates the representation and processing of sluicing, a type of ellipsis where an interrogative CP is reduced to its initial wh-element (the remnant), e.g. Mary danced with someone, but I can't remember (with) who. It is debated whether remnants from within a PP (with who) must appear with this P or whether they can appear without it (`Pstranding'). Existing theoretical literature (Merchant, 2001; a.o.) argues that only languages allowing overt CPs to move wh-elements without their embedding P will allow P-stranding remnants (P-Stranding Generalisation/PSG). Anecdotally, many languages appear to defy this pattern, allowing P-stranding remnants despite disallowing P-stranding overtly. None of these examples, however, are supported by adequate experimental evidence, nor o er a cross-linguistically generalisable explanation. This thesis addresses both these issues. Novel large-scale acceptability data show that both Greek and German, previously proposed robust PSG-examples, do indeed defy it. This behaviour is explained by proposing ellipsis is a type of `noisy channel' (Shannon, 1948; Gibson, Bergen & Piantadosi, 2013), through which the parser must estimate the probability of the intended (elided) message. The parser simultaneously considers the prior likelihood of the intended message (a remnant as part of a full PP) as well as the likelihood of this message being corrupted through `noise' (a deleted P). P-stranding is thus considered a form of deletion, given deletion has been shown to be a likely corruption in noisy channels. A series of reading time studies aimed at supporting this noisy channel model in online processing found results overall consistent with this approach, but also discovered previous work on the processing of sluicing was inaccurate in concluding its active prediction by the parser. Collectively, the work argues for a theory of sluicing involving syntactic structure at the e-site together with sluicing being treated as a noisy channel by the parser

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Extended papers from the MWE 2017 workshop

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Intelligent Systems

    Get PDF
    This book is dedicated to intelligent systems of broad-spectrum application, such as personal and social biosafety or use of intelligent sensory micro-nanosystems such as "e-nose", "e-tongue" and "e-eye". In addition to that, effective acquiring information, knowledge management and improved knowledge transfer in any media, as well as modeling its information content using meta-and hyper heuristics and semantic reasoning all benefit from the systems covered in this book. Intelligent systems can also be applied in education and generating the intelligent distributed eLearning architecture, as well as in a large number of technical fields, such as industrial design, manufacturing and utilization, e.g., in precision agriculture, cartography, electric power distribution systems, intelligent building management systems, drilling operations etc. Furthermore, decision making using fuzzy logic models, computational recognition of comprehension uncertainty and the joint synthesis of goals and means of intelligent behavior biosystems, as well as diagnostic and human support in the healthcare environment have also been made easier
    corecore