Search CORE

794 research outputs found

A Hybrid Extraction Model for Chinese Noun/Verb Synonymous bi-gram Collocations

Author: Li Wanyin
Lu Qin
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

A Corpus-based Language Network Analysis of Near-synonyms in a Specialized Corpus

Author: LU WENYU
Publication venue: 한국해양대학교 대학원
Publication date: 01/08/2017
Field of study

As the international medium of communication for seafarers throughout the world, the importance of English has long been recognized in the maritime industry. Many studies have been conducted on Maritime English teaching and learning, nevertheless, although there are many near-synonyms existing in the language, few studies have been conducted on near-synonyms used in the maritime industry. The objective of this study is to answer the following three questions. First, what are the differences and similarities between different near-synonyms in English? Second, can collocation network analysis provide a new perspective to explain the distinctions of near-synonyms from a micro-scopic level? Third, is semantic domain network analysis useful to distinguish one near-synonym from the other at the macro-scopic level? In pursuit of these research questions, I first illustrated how the idea of incorporating collocates in corpus linguistics, Maritime English, near-synonyms, semantic domains and language network was studied. Then important concepts such as Maritime English, English for Specific Purposes, corpus linguistics, synonymy, collocation, semantic domains and language network analysis were introduced. Third, I compiled a 2.5 million word specialized Maritime English Corpus and proposed a new method of tagging English multi-word compounds, discussing the comparison of with and without multi-word compounds with regard to tokens, types, STTR and mean word length. Fourth, I examined collocates of five groups of near-synonyms, i.e., ship vs. vessel, maritime vs. marine, ocean vs. sea, safety vs. security, and harbor vs. port, drawing data through WordSmith 6.0, tagging semantic domains in Wmatrix 3.0, and conducting network analyses using NetMiner 4.0. In the final stage, from the results and discussions, I was able to answer the research questions. First, maritime near-synonyms generally show clear preference to specific collocates. Due to the specialty of Maritime English, general definitions are not helpful for the distinction between near-synonyms, therefore a new perspective is needed to view the behaviors of maritime words. Second, as a special visualization method, collocation network analysis can provide learners with a direct vision of the relationships between words. Compared with traditional collocation tables, learners are able to more quickly identify the collocates and find the relationship between several node words. In addition, it is much easier for learners to find the collocates exclusive to a specific word, thereby helping them to understand the meaning specific to that word. Third, if the collocation network shows learners relationships of words, the semantic domain network is able to offer guidance cognitively: when a person has a specific word, how he can process it in his mind and therefore find the more appropriate synonym to collocate with. Main semantic domain network analysis shows us the exclusive domains to a certain near-synonym, and therefore defines the concepts exclusive to that near-synonym: furthermore, main semantic domain network analysis and sub-semantic domain network analysis together are able to tell us how near-synonyms show preference or tendency for one synonym rather than another, even when they have shared semantic domains. The options in identifying relationships of near-synonyms can be presented through the classic metaphor of "the forest and the trees." Generally speaking, we see only the vein of a tree leaf through the traditional way of sentence-level analysis. We see the full leaf through collocation network analysis. We see the tree, even the whole forest, through semantic domain network analysis.Contents Chapter 1. Introduction 1 1.1 Focus of Inquiry 1 1.2 Outline of the Thesis 5 Chapter 2. Literature Review 8 2.1 A Brief Synopsis 8 2.2 Maritime English as an English for Specific Purposes (ESP) 9 2.2.1 What is ESP? 9 2.2.2 Maritime English as ESP 10 2.2.3 ESP and Corpus Linguistics 11 2.3 Synonymy 12 2.3.1 Definition of Synonymy 13 2.3.2 Synonymy as a Matter of Degree 15 2.3.3 Criteria for Synonymy Differentiation 18 2.3.4 Near-synonyms in Corpus Linguistics 19 2.4 Collocation 21 2.4.1 Definition of Collocation 21 2.4.2 Collocation in Corpus Linguistics 22 2.4.2.1 Definition of Collocation in Corpus Linguistics 23 2.4.2.2 Collocation vs. Colligation 24 2.4.3 Lexical Priming of Collocation in Psychology 25 2.5 Language Network Analysis 26 2.5.1 Definition 26 2.5.2 Classification 27 2.5.3 Basic Concepts 31 2.5.4 Previous Studies 33 2.6 Semantic Domain Analysis 39 2.6.1 Concepts of Semantic Domains 39 2.6.2 Previous Studies on Semantic Domain Analysis 39 Chapter 3. Data and Methodology 41 3.1 Maritime English Corpus 41 3.1.1 What is a Corpus? 41 3.1.2 Characteristics of a Corpus 42 3.1.2.1 Corpus-driven vs. Corpus-based research 42 3.1.2.2 Specialized Corpora for Specialized Discourse 43 3.1.3 Maritime English Corpus (MEC) 44 3.1.3.1 Sampling of the MEC 45 3.1.3.2 Size, Balance, and Representativeness 51 3.1.3.3 Multi-word Compounds in the MEC 53 3.1.3.4 Basic Information of the MEC 56 3.2 Methodology for Collocates Extraction 60 3.3 Methodology for Networks Visualization 63 3.4 Methodology for Semantic Tagging 65 3.5 Process of Data Analysis 69 Chapter 4. Collocation Network Analysis of Near-synonyms 70 4.1 Meaning Differences 71 4.1.1 Ship vs. Vessel 71 4.1.2 Maritime vs. Marine 72 4.1.3 Sea vs. Ocean 73 4.1.4 Safety vs. Security 74 4.1.5 Port vs. Harbor 76 4.2 Similarity Degree of Groups of Near-synonyms 76 4.2.1 Similarity Degree Based on Number of Shared Collocates 77 4.2.2 Similarity Degree Based on MI3 Cosine Similarity 78 4.3 Collocation Network Analysis 80 4.3.1 Ship vs. Vessel 80 4.3.2 Maritime vs. Marine 82 4.3.3 Sea vs. Ocean 84 4.3.4 Safety vs. Security 85 4.3.5 Port vs. Harbor 87 4.4 Advantages and Limitations of Collocation Network Analysis 88 Chapter 5. Semantic Domain Network Analysis of Near-synonyms 89 5.1 Comparison between Collocation and Semantic Domain Analysis 89 5.2 Semantic Domain Network Analysis of Exclusiveness 92 5.2.1 Ship vs. Vessel 93 5.2.2 Maritime vs. Marine 96 5.2.3 Sea vs. Ocean 99 5.2.4 Safety vs. Security 102 5.2.5 Port vs. Harbor 105 5.3 Analysis of Shared Semantic Domains 108 5.4 Advantages and Limitations of Semantic Domain Network Analysis 112 Chapter 6. Conclusion 113 6.1 Summary 113 6.2 Limitations and Implications 116 References 118 Appendix: Collocates of Near-synonyms 136Docto

한국해양대학교(KMOU)

Using Corpus-based Linguistic Approaches in Sense Prediction Study

Author: Ahrens Kathleen
Hong Jia-Fei
Huang Chu-Ren
Ker Sue-Jin
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

A hybrid extraction model for Chinese noun/verb synonym bi-gram

Author: Li W
Lu Q
Publication venue: Institute for Digital Enhancement of Cognitive Development, Waseda University
Publication date: 11/12/2014
Field of study

2011-2012 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

CiteSeerX

PolyU Institutional Repository

A Similarity Detection Method Based on Distance Matrix Model with Row-Column Order penalty Factor

Author: Han Yaqing
Li Jun
Niu Yan
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/12/2014
Field of study

Paper detection involves multiple disciplines, and making a comprehensive and correct evaluation of academic misconduct is quite a complex and sensitive issue. There are some problems in the existing main detection models, such as incomplete segmentation preprocessing specification, impact of the semantic orders on detection, near-synonym evaluation, slow paper backtrack and so on. This paper presents a sentence-level paper similarity comparison model with segmentation preprocessing based on special identifier. This model integrates the characteristics of vector detection, hamming distance and the longest common substring and carries out detection specific to near-synonyms, word deletion and changes in word order by redefining distance matrix and adding ordinal measures, making sentence similarity detection in terms of semantics and backbone word segmentation more effective. Compared with the traditional paper similarity retrieval, the present method adopts modular-2 arithmetic with low computation. Paper detection method with reliability and high efficiency is of great academic significance in word segmentation, similarity detection and document summarization

Bulletin of Electrical Engineering and Informatics

Neliti

Crossref

Multilingual Lexicon Extraction under Resource-Poor Language Pairs

Author: 서형원
Publication venue: 한국해양대학교
Publication date: 01/08/2015
Field of study

In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Korean–French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Korean–French and Korean–Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and French–Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

한국해양대학교(KMOU)