115 research outputs found

    METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

    Get PDF
    International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

    Attaining Fluency in English through Collocations

    Get PDF

    A Corpus-based Language Network Analysis of Near-synonyms in a Specialized Corpus

    Get PDF
    As the international medium of communication for seafarers throughout the world, the importance of English has long been recognized in the maritime industry. Many studies have been conducted on Maritime English teaching and learning, nevertheless, although there are many near-synonyms existing in the language, few studies have been conducted on near-synonyms used in the maritime industry. The objective of this study is to answer the following three questions. First, what are the differences and similarities between different near-synonyms in English? Second, can collocation network analysis provide a new perspective to explain the distinctions of near-synonyms from a micro-scopic level? Third, is semantic domain network analysis useful to distinguish one near-synonym from the other at the macro-scopic level? In pursuit of these research questions, I first illustrated how the idea of incorporating collocates in corpus linguistics, Maritime English, near-synonyms, semantic domains and language network was studied. Then important concepts such as Maritime English, English for Specific Purposes, corpus linguistics, synonymy, collocation, semantic domains and language network analysis were introduced. Third, I compiled a 2.5 million word specialized Maritime English Corpus and proposed a new method of tagging English multi-word compounds, discussing the comparison of with and without multi-word compounds with regard to tokens, types, STTR and mean word length. Fourth, I examined collocates of five groups of near-synonyms, i.e., ship vs. vessel, maritime vs. marine, ocean vs. sea, safety vs. security, and harbor vs. port, drawing data through WordSmith 6.0, tagging semantic domains in Wmatrix 3.0, and conducting network analyses using NetMiner 4.0. In the final stage, from the results and discussions, I was able to answer the research questions. First, maritime near-synonyms generally show clear preference to specific collocates. Due to the specialty of Maritime English, general definitions are not helpful for the distinction between near-synonyms, therefore a new perspective is needed to view the behaviors of maritime words. Second, as a special visualization method, collocation network analysis can provide learners with a direct vision of the relationships between words. Compared with traditional collocation tables, learners are able to more quickly identify the collocates and find the relationship between several node words. In addition, it is much easier for learners to find the collocates exclusive to a specific word, thereby helping them to understand the meaning specific to that word. Third, if the collocation network shows learners relationships of words, the semantic domain network is able to offer guidance cognitively: when a person has a specific word, how he can process it in his mind and therefore find the more appropriate synonym to collocate with. Main semantic domain network analysis shows us the exclusive domains to a certain near-synonym, and therefore defines the concepts exclusive to that near-synonym: furthermore, main semantic domain network analysis and sub-semantic domain network analysis together are able to tell us how near-synonyms show preference or tendency for one synonym rather than another, even when they have shared semantic domains. The options in identifying relationships of near-synonyms can be presented through the classic metaphor of "the forest and the trees." Generally speaking, we see only the vein of a tree leaf through the traditional way of sentence-level analysis. We see the full leaf through collocation network analysis. We see the tree, even the whole forest, through semantic domain network analysis.Contents Chapter 1. Introduction 1 1.1 Focus of Inquiry 1 1.2 Outline of the Thesis 5 Chapter 2. Literature Review 8 2.1 A Brief Synopsis 8 2.2 Maritime English as an English for Specific Purposes (ESP) 9 2.2.1 What is ESP? 9 2.2.2 Maritime English as ESP 10 2.2.3 ESP and Corpus Linguistics 11 2.3 Synonymy 12 2.3.1 Definition of Synonymy 13 2.3.2 Synonymy as a Matter of Degree 15 2.3.3 Criteria for Synonymy Differentiation 18 2.3.4 Near-synonyms in Corpus Linguistics 19 2.4 Collocation 21 2.4.1 Definition of Collocation 21 2.4.2 Collocation in Corpus Linguistics 22 2.4.2.1 Definition of Collocation in Corpus Linguistics 23 2.4.2.2 Collocation vs. Colligation 24 2.4.3 Lexical Priming of Collocation in Psychology 25 2.5 Language Network Analysis 26 2.5.1 Definition 26 2.5.2 Classification 27 2.5.3 Basic Concepts 31 2.5.4 Previous Studies 33 2.6 Semantic Domain Analysis 39 2.6.1 Concepts of Semantic Domains 39 2.6.2 Previous Studies on Semantic Domain Analysis 39 Chapter 3. Data and Methodology 41 3.1 Maritime English Corpus 41 3.1.1 What is a Corpus? 41 3.1.2 Characteristics of a Corpus 42 3.1.2.1 Corpus-driven vs. Corpus-based research 42 3.1.2.2 Specialized Corpora for Specialized Discourse 43 3.1.3 Maritime English Corpus (MEC) 44 3.1.3.1 Sampling of the MEC 45 3.1.3.2 Size, Balance, and Representativeness 51 3.1.3.3 Multi-word Compounds in the MEC 53 3.1.3.4 Basic Information of the MEC 56 3.2 Methodology for Collocates Extraction 60 3.3 Methodology for Networks Visualization 63 3.4 Methodology for Semantic Tagging 65 3.5 Process of Data Analysis 69 Chapter 4. Collocation Network Analysis of Near-synonyms 70 4.1 Meaning Differences 71 4.1.1 Ship vs. Vessel 71 4.1.2 Maritime vs. Marine 72 4.1.3 Sea vs. Ocean 73 4.1.4 Safety vs. Security 74 4.1.5 Port vs. Harbor 76 4.2 Similarity Degree of Groups of Near-synonyms 76 4.2.1 Similarity Degree Based on Number of Shared Collocates 77 4.2.2 Similarity Degree Based on MI3 Cosine Similarity 78 4.3 Collocation Network Analysis 80 4.3.1 Ship vs. Vessel 80 4.3.2 Maritime vs. Marine 82 4.3.3 Sea vs. Ocean 84 4.3.4 Safety vs. Security 85 4.3.5 Port vs. Harbor 87 4.4 Advantages and Limitations of Collocation Network Analysis 88 Chapter 5. Semantic Domain Network Analysis of Near-synonyms 89 5.1 Comparison between Collocation and Semantic Domain Analysis 89 5.2 Semantic Domain Network Analysis of Exclusiveness 92 5.2.1 Ship vs. Vessel 93 5.2.2 Maritime vs. Marine 96 5.2.3 Sea vs. Ocean 99 5.2.4 Safety vs. Security 102 5.2.5 Port vs. Harbor 105 5.3 Analysis of Shared Semantic Domains 108 5.4 Advantages and Limitations of Semantic Domain Network Analysis 112 Chapter 6. Conclusion 113 6.1 Summary 113 6.2 Limitations and Implications 116 References 118 Appendix: Collocates of Near-synonyms 136Docto

    Towards Interactive Multidimensional Visualisations for Corpus Linguistics

    Get PDF
    We propose the novel application of dynamic and interactive visualisation techniques to support the iterative and exploratory investigations typical of the corpus linguistics methodology. Very large scale text analysis is already carried out in corpus-based language analysis by employing methods such as frequency profiling, keywords, concordancing, collocations and n-grams. However, at present only basic visualisation methods are utilised. In this paper, we describe case studies of multiple types of key word clouds, explorer tools for collocation networks, and compare network and language distance visualisations for online social networks. These are shown to fit better with the iterative data-driven corpus methodology, and permit some level of scalability to cope with ever increasing corpus size and complexity. In addition, they will allow corpus linguistic methods to be used more widely in the digital humanities and social sciences since the learning curve with visualisations is shallower for non-expert

    Systematic Analysis of the Factors Contributing to the Variation and Change of the Microbiome

    Get PDF
    abstract: Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications, citations, funding, collaborations, and other explanatory variables or contextual factors. What is observed in the microbiome, or any historical evolution of a scientific field or scientific knowledge, is that these changes are related to changes in knowledge, but what is not understood is how to measure and track changes in knowledge. This investigation highlights how contextual factors from the language and social context of the microbiome are related to changes in the usage, meaning, and scientific knowledge on the microbiome. Two interconnected studies integrating qualitative and quantitative evidence examine the variation and change of the microbiome evidence are presented. First, the concepts microbiome, metagenome, and metabolome are compared to determine the boundaries of the microbiome concept in relation to other concepts where the conceptual boundaries have been cited as overlapping. A collection of publications for each concept or corpus is presented, with a focus on how to create, collect, curate, and analyze large data collections. This study concludes with suggestions on how to analyze biomedical concepts using a hybrid approach that combines results from the larger language context and individual words. Second, the results of a systematic review that describes the variation and change of microbiome research, funding, and knowledge are examined. A corpus of approximately 28,000 articles on the microbiome are characterized, and a spectrum of microbiome interpretations are suggested based on differences related to context. The collective results suggest the microbiome is a separate concept from the metagenome and metabolome, and the variation and change to the microbiome concept was influenced by contextual factors. These results provide insight into how concepts with extensive resources behave within biomedicine and suggest the microbiome is possibly representative of conceptual change or a preview of new dynamics within science that are expected in the future.Dissertation/ThesisDoctoral Dissertation Biology 201

    Systematic Exploration of Collocation Profiles

    Get PDF
    The central issue in corpus-driven linguistics is the detection and description of patterns in language usage. The features that constitute the notion of a pattern can be computed to a certain extent by statistical (collocation) methods, but a crucial part of the notion may vary depending on applications and users. Thus, typically, any computed collocation cluster will have to be interpreted hermeneutically. Often it might be captured by a generalized, more abstract pattern. We present a generic process model that supports the recognition, interpretation, and expression of the patterns inside and of the relations between clusters. By this, clusters can be merged virtually according to any notion of a 'pattern', and their relations can be exploited for different application

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference
    corecore