4,845 research outputs found

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Gradient Metaphoricity of the Preposition in: A Corpus-based Approach to Chinese Academic Writing in English

    Get PDF
    In Cognitive Linguistics, a conceptual metaphor is a systematic set of correspondences between two domains of experience (Kövecses 2020: 2). In order to have an extensive understanding of metaphors, metaphoricity (Müller and Tag 2010; Dunn 2011; Jensen and Cuffari 2014; Nacey and Jensen 2017) has been emphasized to address one of the properties of metaphors in language usage: gradience (Hanks 2006; Dunn 2011, 2014), which indicates that metaphorical expressions can be measured. Despite many noteworthy contributions, studies of metaphoricity are often accused of subjectivity (Müller 2008; Jensen and Cuffari 2014; Jensen 2017), this is why this study uses a big corpus as a database. Therefore, the main aim of this dissertation is to measure the gradient senses of the preposition in in an objective way, thus mapping the highly systematic semantic extension. Based on these gradient senses, the semantic and syntactic features of the preposition in produced by advanced Chinese English-major learners are investigated, combining quantitative and qualitative research methods. A quantitative analysis of the literal and other ten metaphorical senses of the preposition in is made at first. In accounting for the five factors influencing image schemata of each sense: “scale of Landmark”, “visibility”, “path”, “inclusion” and “boundary”, the formula of measuring the gradability of metaphorical degree is deduced: Metaphoricity=[[#Visibility] +[#Path] +[#Inclusion] +[#Boundary]]*[#Scale of Landmark]. The result is that the primary sense has the highest value:12, and all other extended senses have values down to zero. The more shared features with proto-scene, the higher the value of the metaphorical sense, and the less metaphorical the sense. EVENT and PERSON are the “least metaphoric” (value = 9-11); SITUATION, NUMBER, CONTENT and FIELD are “weak metaphoric” (value = 6-8); Also included are SEGMENTATION, TIME and MANNER (value = 3-5), and they are “strong metaphoric”; PURPOSE shares the least feature with proto-scene, and it has the lowest value, so it is “most metaphoric” (value = 0-2). Then, a corpus-based approach is employed, which offers a model for employing a corpus-based approach in Cognitive Linguistics. It compares two compiled sub-corpora: Chinese Master Academic Writing Corpus and Chinese Doctorate Academic Writing Corpus. The findings show that, on the semantic level, Chinese English-major students overuse in with a low level of metaphoricity, even advanced learners use the most metaphorical in rarely. In terms of syntactic behaviours, the most frequent nouns in [in+noun] construction are weakly metaphoric, whilst the nouns in the construction [in the noun of] are EVENT sense, which is least metaphorical. Moreover, action verbs tend to be used in the construction [verb+in] and [in doing sth.] in both master and doctorate groups. In the qualitative study, the divergent usages of the preposition in are explored. The preposition in is often substituted with other prepositions, such as on and at. The fundamental reason for the Chinese learners’ weakness is the negative transfer from their mother tongue (Wang 2001; Gong 2007; Zhang 2010). Although in and its Chinese equivalence zai...li (在...里) share the same proto-scene, there are discrepancies: the metaphorical senses of the preposition in are TIME, PURPOSE, NUMBER, CONTENT, FIELD, EVENT, SITUATION, SEGMENTATION, MANNER, PERSON, while those of zai...li (在...里) are only five: TIME, CONTENT, EVENT, SITUATION and PERSON. Thus the image schemata of each sense cannot be correspondingly mapped onto each other in different languages. This study also provides evidence for the universality and variation of spatial metaphors on the ground of cultural models. Philosophically, it supports the standpoint of Embodiment philosophy that abstract concepts are constructed on the basis of spatial metaphors that are grounded in the physical and cultural experience

    Neural Combinatory Constituency Parsing

    Get PDF
    東京都立大学Tokyo Metropolitan University博士(情報科学)doctoral thesi

    Structured Named Entities

    Get PDF
    The names of people, locations, and organisations play a central role in language, and named entity recognition (NER) has been widely studied, and successfully incorporated, into natural language processing (NLP) applications. The most common variant of NER involves identifying and classifying proper noun mentions of these and miscellaneous entities as linear spans in text. Unfortunately, this version of NER is no closer to a detailed treatment of named entities than chunking is to a full syntactic analysis. NER, so construed, reflects neither the syntactic nor semantic structure of NE mentions, and provides insufficient categorical distinctions to represent that structure. Representing this nested structure, where a mention may contain mention(s) of other entities, is critical for applications such as coreference resolution. The lack of this structure creates spurious ambiguity in the linear approximation. Research in NER has been shaped by the size and detail of the available annotated corpora. The existing structured named entity corpora are either small, in specialist domains, or in languages other than English. This thesis presents our Nested Named Entity (NNE) corpus of named entities and numerical and temporal expressions, taken from the WSJ portion of the Penn Treebank (PTB, Marcus et al., 1993). We use the BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005a) as our basis, manually annotating it with a principled, fine-grained, nested annotation scheme and detailed annotation guidelines. The corpus comprises over 279,000 entities over 49,211 sentences (1,173,000 words), including 118,495 top-level entities. Our annotations were designed using twelve high-level principles that guided the development of the annotation scheme and difficult decisions for annotators. We also monitored the semantic grammar that was being induced during annotation, seeking to identify and reinforce common patterns to maintain consistent, parsimonious annotations. The result is a scheme of 118 hierarchical fine-grained entity types and nesting rules, covering all capitalised mentions of entities, and numerical and temporal expressions. Unlike many corpora, we have developed detailed guidelines, including extensive discussion of the edge cases, in an ongoing dialogue with our annotators which is critical for consistency and reproducibility. We annotated independently from the PTB bracketing, allowing annotators to choose spans which were inconsistent with the PTB conventions and errors, and only refer back to it to resolve genuine ambiguity consistently. We merged our NNE with the PTB, requiring some systematic and one-off changes to both annotations. This allows the NNE corpus to complement other PTB resources, such as PropBank, and inform PTB-derived corpora for other formalisms, such as CCG and HPSG. We compare this corpus against BBN. We consider several approaches to integrating the PTB and NNE annotations, which affect the sparsity of grammar rules and visibility of syntactic and NE structure. We explore their impact on parsing the NNE and merged variants using the Berkeley parser (Petrov et al., 2006), which performs surprisingly well without specialised NER features. We experiment with flattening the NNE annotations into linear NER variants with stacked categories, and explore the ability of a maximum entropy and a CRF NER system to reproduce them. The CRF performs substantially better, but is infeasible to train on the enormous stacked category sets. The flattened output of the Berkeley parser are almost competitive with the CRF. Our results demonstrate that the NNE corpus is feasible for statistical models to reproduce. We invite researchers to explore new, richer models of (joint) parsing and NER on this complex and challenging task. Our nested named entity corpus will improve a wide range of NLP tasks, such as coreference resolution and question answering, allowing automated systems to understand and exploit the true structure of named entities

    Towards Multilingual Coreference Resolution

    Get PDF
    The current work investigates the problems that occur when coreference resolution is considered as a multilingual task. We assess the issues that arise when a framework using the mention-pair coreference resolution model and memory-based learning for the resolution process are used. Along the way, we revise three essential subtasks of coreference resolution: mention detection, mention head detection and feature selection. For each of these aspects we propose various multilingual solutions including both heuristic, rule-based and machine learning methods. We carry out a detailed analysis that includes eight different languages (Arabic, Catalan, Chinese, Dutch, English, German, Italian and Spanish) for which datasets were provided by the only two multilingual shared tasks on coreference resolution held so far: SemEval-2 and CoNLL-2012. Our investigation shows that, although complex, the coreference resolution task can be targeted in a multilingual and even language independent way. We proposed machine learning methods for each of the subtasks that are affected by the transition, evaluated and compared them to the performance of rule-based and heuristic approaches. Our results confirmed that machine learning provides the needed flexibility for the multilingual task and that the minimal requirement for a language independent system is a part-of-speech annotation layer provided for each of the approached languages. We also showed that the performance of the system can be improved by introducing other layers of linguistic annotations, such as syntactic parses (in the form of either constituency or dependency parses), named entity information, predicate argument structure, etc. Additionally, we discuss the problems occurring in the proposed approaches and suggest possibilities for their improvement

    Investigating Multilingual Coreference Resolution by Universal Annotations

    Full text link
    Multilingual coreference resolution (MCR) has been a long-standing and challenging task. With the newly proposed multilingual coreference dataset, CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by using its harmonized universal morphosyntactic and coreference annotations. First, we study coreference by examining the ground truth data at different linguistic levels, namely mention, entity and document levels, and across different genres, to gain insights into the characteristics of coreference across multiple languages. Second, we perform an error analysis of the most challenging cases that the SotA system fails to resolve in the CRAC 2022 shared task using the universal annotations. Last, based on this analysis, we extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits for the MCR task. Our results show that our best configuration of features improves the baseline by 0.9% F1 score.Comment: Accepted at Findings of EMNLP202

    Investigating and extending the methods in automated opinion analysis through improvements in phrase based analysis

    Get PDF
    Opinion analysis is an area of research which deals with the computational treatment of opinion statement and subjectivity in textual data. Opinion analysis has emerged over the past couple of decades as an active area of research, as it provides solutions to the issues raised by information overload. The problem of information overload has emerged with the advancements in communication technologies which gave rise to an exponential growth in user generated subjective data available online. Opinion analysis has a rich set of applications which are used to enable opportunities for organisations such as tracking user opinions about products, social issues in communities through to engagement in political participation etc.The opinion analysis area shows hyperactivity in recent years and research at different levels of granularity has, and is being undertaken. However it is observed that there are limitations in the state-of-the-art, especially as dealing with the level of granularities on their own does not solve current research issues. Therefore a novel sentence level opinion analysis approach utilising clause and phrase level analysis is proposed. This approach uses linguistic and syntactic analysis of sentences to understand the interdependence of words within sentences, and further uses rule based analysis for phrase level analysis to calculate the opinion at each hierarchical structure of a sentence. The proposed opinion analysis approach requires lexical and contextual resources for implementation. In the context of this Thesis the approach is further presented as part of an extended unifying framework for opinion analysis resulting in the design and construction of a novel corpus. The above contributions to the field (approach, framework and corpus) are evaluated within the Thesis and are found to make improvements on existing limitations in the field, particularly with regards to opinion analysis automation. Further work is required in integrating a mechanism for greater word sense disambiguation and in lexical resource development

    Mapping product and service innovation: A bibliometric analysis and a typology

    Get PDF
    Research conducted in the innovation field lags behind organizations’ general technological development and innovativeness. Literature that previously depicted innovation types in developed markets is markedly different from progressively publicized emerging market innovation types. While capital-abundant firms tend to engage in respective pioneering and incremental innovation loops, resource-constrained firms and firms in emerging countries may partially free-ride on existing products and services through innovations such as copycat and frugal. To date, there have been no attempts to holistically consolidate product and service innovation types into one overarching typology. Using novel methods of text mining and co-citation analysis, this study systematically maps three decades of product and service innovation scholarship to provide a typology of eight major product and service innovation types. This is further supported by case study analysis to demonstrate how these innovation types fit into the cost vs market novelty matrix. This study is unique in its methodological proposition to systematically review the innovation scholarship of more than 1,400 articles through comprehensive, quantified, and objective methods that offer transparent and reproducible results. The study provides some clarity regarding the classifications and characteristics of the innovation typology

    Doctor of Philosophy

    Get PDF
    dissertationEvents are one important type of information throughout text. Event extraction is an information extraction (IE) task that involves identifying entities and objects (mainly noun phrases) that represent important roles in events of a particular type. However, the extraction performance of current event extraction systems is limited because they mainly consider local context (mostly isolated sentences) when making each extraction decision. My research aims to improve both coverage and accuracy of event extraction performance by explicitly identifying event contexts before extracting individual facts. First, I introduce new event extraction architectures that incorporate discourse information across a document to seek out and validate pieces of event descriptions within the document. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively \zoom in" on relevant event information. LINKER is a unied discourse-guided approach that includes a structured sentence classier to sequentially read a story and determine which sentences contain event information based on both the local and preceding contexts. Experimental results on two distinct event domains show that compared to previous event extraction systems, TIER can nd more event information while maintaining a good extraction accuracy, and LINKER can further improve extraction accuracy. Finding documents that describe a specic type of event is also highly challenging because of the wide variety and ambiguity of event expressions. In this dissertation, I present the multifaceted event recognition approach that uses event dening characteristics (facets), in addition to event expressions, to eectively resolve the complexity of event descriptions. I also present a novel bootstrapping algorithm to automatically learn event expressions as well as facets of events, which requires minimal human supervision. Experimental results show that the multifaceted event recognition approach can eectively identify documents that describe a particular type of event and make event extraction systems more precise
    corecore