1,685 research outputs found

    Refining the use of the web (and web search) as a language teaching and learning resource

    Get PDF
    The web is a potentially useful corpus for language study because it provides examples of language that are contextualized and authentic, and is large and easily searchable. However, web contents are heterogeneous in the extreme, uncontrolled and hence 'dirty,' and exhibit features different from the written and spoken texts in other linguistic corpora. This article explores the use of the web and web search as a resource for language teaching and learning. We describe how a particular derived corpus containing a trillion word tokens in the form of n-grams has been filtered by word lists and syntactic constraints and used to create three digital library collections, linked with other corpora and the live web, that exploit the affordances of web text and mitigate some of its constraints

    Multi-facet rating of online hotel reviews: issues, methods and experiments

    Get PDF
    Online product reviews are becoming increasingly popular, and are being used more and more frequently by consumers in order to choose among competing products. Tools that rank competing products in terms of the satisfaction of consumers that have purchased the product before, are thus also becoming popular. We tackle the problem of rating (i.e., attributing a numerical score of satisfaction to) consumer reviews based on their tex- tual content. In this work we focus on multi-facet rating of hotel reviews, i.e., on the case in which the review of a hotel must be rated several times, according to several aspects (e.g., cleanliness, dining facilities, centrality of location). We explore several aspects of the problem, including the vectorial representation of the text based on sentiment analysis, collocation analysis, and feature selection for ordinal-regression learning. We present the results of experiments conducted on a corpus of approximately 15,000 hotel reviews that we have crawled from a popular hotel review site

    Computational Linguistics and Natural Language Processing

    Get PDF
    This chapter provides an introduction to computational linguistics methods, with focus on their applications to the practice and study of translation. It covers computational models, methods and tools for collection, storage, indexing and analysis of linguistic data in the context of translation, and discusses the main methodological issues and challenges in this field. While an exhaustive review of existing computational linguistics methods and tools is beyond the scope of this chapter, we describe the most representative approaches, and illustrate them with descriptions of typical applications.Comment: This is the unedited author's copy of a text which appeared as a chapter in "The Routledge Handbook of Translation and Methodology'', edited by F Zanettin and C Rundle (2022

    The semantics of sustainable development: A corpus-assisted, ecological analysis of discourse across languages

    Get PDF
    In European societies, sustainable development is often mentioned by politicians and media. But what do politicians and media mean when they use the expression sustainable development? Linguistic research has shown that sustainable development is frequently intended as an unspecified condition that needs to be achieved with an anthropocentric attitude (Alexander 2002, Mahlberg 2007, Naeem et al. 2016). The present research aims at outlining the discursive construction of sustainable development in the political discourse of the United Nations’ 2030 Agenda for Sustainable Development and in news discourse appeared after the release of the UN’s resolution. The discursive construction of sustainable development is explored in English, Hungarian and Italian and it is identified by means of two concepts: cultural keywords, namely politically, socially and culturally salient lexemes (Williams 1983); and meaning by collocation, namely the semantics that lexemes acquire thanks to their co-occurring with a limited set of words belonging to certain word classes, fitting a common semantic area and sharing a mutual connotation (Firth 1957, Sinclair 1991). The study of cultural keywords and meaning by collocation is carried out within the theoretical framework of corpus-assisted, ecological analysis of discourse with a cross-linguistic approach. The discursive construction of sustainable development is investigated in two corpora: the 2030 Agenda Corpus and the Sustainable development Corpus (or SusCorp). The 2030 Agenda Corpus is a multilingual, parallel corpus of political discourse including the English, Hungarian and Italian versions of the UN’s 2030 Agenda for Sustainable Development; the English, Hungarian and Italian sections of the corpus count between 15,000 and 18,000 tokens each. The SusCorp is a multilingual, comparable corpus of news discourse containing broadsheet articles published between 2016 and 2018 in the English, Hungarian and Italian press; the English, Hungarian and Italian sections of the corpus count between 250,000 and 450,000 tokens each. The two corpora are analysed in turn in search for cultural keywords and meaning by collocation. Cultural keywords are found among the most frequent and statistically salient lexemes of the English, Hungarian and Italian subcorpora. Meaning by collocation is outlined by extracting the collocational patterns of the English lexical items sustainable, sustainability, sustainable development and their Hungarian and Italian translational equivalents for the 2030 Agenda Corpus, and by collecting the collocational patterns of the English lexeme sustainable and its Hungarian and Italian translational equivalents for the SusCorp. The cultural keywords identified in both corpora and for all languages mainly refer to sustainable development and to the sustainability goals recommended by the UN’s 2030 Agenda. Also the international dimension of sustainability is tackled cross-linguistically by cultural keywords in both corpora. In addition, in the SusCorp environmental concerns like climate change feature among the cultural keywords of English and Italian, while Hungarian cultural keywords include issues like migration. The meaning by collocation extracted for the adjective sustainable in both corpora and for all languages makes the lexeme represent a positive quality associated with other positive qualities and characterising material processes of change, depletion, improving and supporting. The meaning by collocation of the noun sustainability in the 2030 Agenda makes the noun a property bound especially to economic matters. The meaning by collocation of sustainable development in the 2030 Agenda makes it a condition that needs to be achieved for the wellbeing of people worldwide thanks to the aid of the UN’s Agenda

    Exploratory Search on Mobile Devices

    Get PDF
    The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    A Corpus-based Language Network Analysis of Near-synonyms in a Specialized Corpus

    Get PDF
    As the international medium of communication for seafarers throughout the world, the importance of English has long been recognized in the maritime industry. Many studies have been conducted on Maritime English teaching and learning, nevertheless, although there are many near-synonyms existing in the language, few studies have been conducted on near-synonyms used in the maritime industry. The objective of this study is to answer the following three questions. First, what are the differences and similarities between different near-synonyms in English? Second, can collocation network analysis provide a new perspective to explain the distinctions of near-synonyms from a micro-scopic level? Third, is semantic domain network analysis useful to distinguish one near-synonym from the other at the macro-scopic level? In pursuit of these research questions, I first illustrated how the idea of incorporating collocates in corpus linguistics, Maritime English, near-synonyms, semantic domains and language network was studied. Then important concepts such as Maritime English, English for Specific Purposes, corpus linguistics, synonymy, collocation, semantic domains and language network analysis were introduced. Third, I compiled a 2.5 million word specialized Maritime English Corpus and proposed a new method of tagging English multi-word compounds, discussing the comparison of with and without multi-word compounds with regard to tokens, types, STTR and mean word length. Fourth, I examined collocates of five groups of near-synonyms, i.e., ship vs. vessel, maritime vs. marine, ocean vs. sea, safety vs. security, and harbor vs. port, drawing data through WordSmith 6.0, tagging semantic domains in Wmatrix 3.0, and conducting network analyses using NetMiner 4.0. In the final stage, from the results and discussions, I was able to answer the research questions. First, maritime near-synonyms generally show clear preference to specific collocates. Due to the specialty of Maritime English, general definitions are not helpful for the distinction between near-synonyms, therefore a new perspective is needed to view the behaviors of maritime words. Second, as a special visualization method, collocation network analysis can provide learners with a direct vision of the relationships between words. Compared with traditional collocation tables, learners are able to more quickly identify the collocates and find the relationship between several node words. In addition, it is much easier for learners to find the collocates exclusive to a specific word, thereby helping them to understand the meaning specific to that word. Third, if the collocation network shows learners relationships of words, the semantic domain network is able to offer guidance cognitively: when a person has a specific word, how he can process it in his mind and therefore find the more appropriate synonym to collocate with. Main semantic domain network analysis shows us the exclusive domains to a certain near-synonym, and therefore defines the concepts exclusive to that near-synonym: furthermore, main semantic domain network analysis and sub-semantic domain network analysis together are able to tell us how near-synonyms show preference or tendency for one synonym rather than another, even when they have shared semantic domains. The options in identifying relationships of near-synonyms can be presented through the classic metaphor of "the forest and the trees." Generally speaking, we see only the vein of a tree leaf through the traditional way of sentence-level analysis. We see the full leaf through collocation network analysis. We see the tree, even the whole forest, through semantic domain network analysis.Contents Chapter 1. Introduction 1 1.1 Focus of Inquiry 1 1.2 Outline of the Thesis 5 Chapter 2. Literature Review 8 2.1 A Brief Synopsis 8 2.2 Maritime English as an English for Specific Purposes (ESP) 9 2.2.1 What is ESP? 9 2.2.2 Maritime English as ESP 10 2.2.3 ESP and Corpus Linguistics 11 2.3 Synonymy 12 2.3.1 Definition of Synonymy 13 2.3.2 Synonymy as a Matter of Degree 15 2.3.3 Criteria for Synonymy Differentiation 18 2.3.4 Near-synonyms in Corpus Linguistics 19 2.4 Collocation 21 2.4.1 Definition of Collocation 21 2.4.2 Collocation in Corpus Linguistics 22 2.4.2.1 Definition of Collocation in Corpus Linguistics 23 2.4.2.2 Collocation vs. Colligation 24 2.4.3 Lexical Priming of Collocation in Psychology 25 2.5 Language Network Analysis 26 2.5.1 Definition 26 2.5.2 Classification 27 2.5.3 Basic Concepts 31 2.5.4 Previous Studies 33 2.6 Semantic Domain Analysis 39 2.6.1 Concepts of Semantic Domains 39 2.6.2 Previous Studies on Semantic Domain Analysis 39 Chapter 3. Data and Methodology 41 3.1 Maritime English Corpus 41 3.1.1 What is a Corpus? 41 3.1.2 Characteristics of a Corpus 42 3.1.2.1 Corpus-driven vs. Corpus-based research 42 3.1.2.2 Specialized Corpora for Specialized Discourse 43 3.1.3 Maritime English Corpus (MEC) 44 3.1.3.1 Sampling of the MEC 45 3.1.3.2 Size, Balance, and Representativeness 51 3.1.3.3 Multi-word Compounds in the MEC 53 3.1.3.4 Basic Information of the MEC 56 3.2 Methodology for Collocates Extraction 60 3.3 Methodology for Networks Visualization 63 3.4 Methodology for Semantic Tagging 65 3.5 Process of Data Analysis 69 Chapter 4. Collocation Network Analysis of Near-synonyms 70 4.1 Meaning Differences 71 4.1.1 Ship vs. Vessel 71 4.1.2 Maritime vs. Marine 72 4.1.3 Sea vs. Ocean 73 4.1.4 Safety vs. Security 74 4.1.5 Port vs. Harbor 76 4.2 Similarity Degree of Groups of Near-synonyms 76 4.2.1 Similarity Degree Based on Number of Shared Collocates 77 4.2.2 Similarity Degree Based on MI3 Cosine Similarity 78 4.3 Collocation Network Analysis 80 4.3.1 Ship vs. Vessel 80 4.3.2 Maritime vs. Marine 82 4.3.3 Sea vs. Ocean 84 4.3.4 Safety vs. Security 85 4.3.5 Port vs. Harbor 87 4.4 Advantages and Limitations of Collocation Network Analysis 88 Chapter 5. Semantic Domain Network Analysis of Near-synonyms 89 5.1 Comparison between Collocation and Semantic Domain Analysis 89 5.2 Semantic Domain Network Analysis of Exclusiveness 92 5.2.1 Ship vs. Vessel 93 5.2.2 Maritime vs. Marine 96 5.2.3 Sea vs. Ocean 99 5.2.4 Safety vs. Security 102 5.2.5 Port vs. Harbor 105 5.3 Analysis of Shared Semantic Domains 108 5.4 Advantages and Limitations of Semantic Domain Network Analysis 112 Chapter 6. Conclusion 113 6.1 Summary 113 6.2 Limitations and Implications 116 References 118 Appendix: Collocates of Near-synonyms 136Docto

    El tratamiento y la representación de las colocaciones verbales en el lenguaje especializado del turismo de aventura

    Get PDF
    A collocation is considered a frequent co-occurrence of two words which hold a syntactic relationship and whose elements enjoy a different status. Given their perception as a unit in language, access to the prominent word (base) involves immediate access to the other item (collocate). In terms of meaning, some combinations tend to be more transparent than others. The pervasiveness of these word associations in language has sparked a strong research interest in the last decades. A compelling reason for this approach may be the fact that they are naturally produced by native speakers but must be actively learned by non-native individuals. Not only has this reality led to their treatment in the general language, but it has also become a legitimate field of study in a wide range of specialized languages, such as the environment, computing, law or tourism, which is our object of study. As a consequence, specialized knowledge resources covering this type of word combinations have seen the light with the primary purpose of offering some extra help to people who deal with this type of language, for example, translators, linguists or other professionals. Nevertheless, there is still much to do in this respect. Taken this into account, it is hypothesized that verb collocations in the specialized language of adventure tourism convey specialized meaning that is worth being collected in terminological products. Therefore, this work endeavors, as its main purpose, to perform a deep analysis of verb collocations in this specialized domain and their implementation in the entries for motion verbs in DicoAdventure, a specialized dictionary of adventure tourism, whose inspirational idea was to highlight the significant role of verbs in the linguistic expression of concepts. Accordingly, the following theoretical objectives were set: first, to cover the linguistic branches which influence specialized lexicography; second, to define the concept of specialized collocation; and third, to examine a vast number of lexicographical and terminological resources so as to discover the items of information that would make an adequate representation of collocations in a specialized dictionary and, then, design a model for such task. Furthermore, the following practical objectives were formulated: first, to extract the motion verbs which would be the bases of the collocations implemented; second, to retrieve the lexical collocations of these verbs; and third, to classify the resulting list of collocations according to the meaning expressed, that is, actual motion or fictive (or metaphorical) motion. The practical steps taken in this research were based on the English monolingual specialized corpus ADVENCOR, which contains promotional texts about adventure tourism, and the use of corpus management software. The results of the theoretical work can be summarized as follows: (1) the specialized language of adventure tourism must be considered as specialized as any others; (2) collocations are not usually encoded in verb entries in dictionaries; and (3) a specialized collocation carries specialized knowledge which must be covered in terminological products. On the other hand, regarding the practical work, 12% of the verbs extracted were selected, as they were the ones expressing motion. However, only 46.61% of them produced collocations according to the extraction criteria established. Last, after applying more strict criteria for the collocation classification, only 25.42% of the verbs along with their collocations were collected in the dictionary. In addition to these results, the theory of Frame Semantics proved useful to understand the meaning of the verbs and their collocates. As for their implementation, which was the primary objective of this doctoral dissertation, the inclusion of verb collocations was of paramount importance for the identification of distinct meanings expressed by one verb in different contexts, as collocates conveyed subtle nuances of meaning. Finally, it was concluded that the incorporation of explanations about the combinations in lay terms facilitates the comprehension of the entries to any type of user, from experts to laypersons, which makes DicoAdventure a terminological product that can render valuable assistance to individuals with distinct specialized expertise.Una colocación es una coaparición frecuente de dos palabras que mantienen una relación sintáctica y cuyos elementos alcanzan un estatus diferente. Puesto que se perciben como una unidad del lenguaje, el acceso al elemento prominente (base) conlleva el acceso inmediato al otro componente (colocativo). Con respecto a su significado, algunas combinaciones tienden a ser más transparentes que otras. La constante presencia de las colocaciones en el lenguaje ha despertado gran interés por su investigación en las últimas décadas. Una razón convincente de este acercamiento podría ser el hecho de que los hablantes nativos las producen de forma natural, mientras que los no nativos deben aprenderlas de manera activa. Esta realidad no solo ha llevado a su tratamiento en el lenguaje general, sino también a que se hayan convertido en un campo de estudio legítimo en una amplia gama de lenguajes especializados, como son el medio ambiente, la informática, el derecho o el turismo, que es el objeto de estudio de esta investigación. Como consecuencia, se han creado recursos de conocimiento especializado con el propósito fundamental de ofrecer ayuda a las personas que interactúan con este tipo de lenguaje, por ejemplo, traductores, lingüistas u otro tipo de profesionales. No obstante, aún queda mucho por hacer en este aspecto. Teniendo esto en cuenta, la hipótesis de este trabajo se basa en la idea de que las colocaciones verbales en el lenguaje especializado del turismo de aventura expresan significados especializados que merecen ser recopilados en productos terminológicos. Por lo tanto, este trabajo tiene como principal objetivo el estudio exhaustivo de las colocaciones verbales en este campo de especialidad y su implementación en las entradas de los verbos de movimiento en DicoAdventure, un diccionario especializado del turismo de aventura, cuyo punto de partida fue la intención de destacar el importante papel que juegan los verbos en la expresión lingüística de los conceptos. Por consiguiente, se establecieron los siguientes objetivos teóricos: primero, revisar las ramas de la lingüística que ejercen una influencia en la lexicografía especializada; segundo, definir el concepto de colocación especializada; y tercero, examinar un gran número de recursos lexicográficos y terminológicos para descubrir qué tipo de información conformaría una representación adecuada de colocaciones en un diccionario especializado y, a continuación, diseñar un modelo para esta tarea. Además, se propusieron estos objetivos prácticos: primero, extraer los verbos de movimiento que serían las bases de las colocaciones implementadas; segundo, extraer las colocaciones léxicas de estos verbos; y tercero; clasificar la lista resultante de colocaciones según su significado, es decir, movimiento real o movimiento figurado (o metafórico). Los pasos prácticos que se dieron en esta investigación se llevaron a cabo mediante la gestión del corpus especializado monolingüe en inglés ADVENCOR, que contiene textos promocionales sobre el turismo de aventura, y el uso de software de gestión de corpus. Los resultados de la parte teórica del trabajo se pueden resumir de la siguiente manera: (1) el lenguaje especializado del turismo de aventura debe considerarse tan especializado como otros; (2) las colocaciones no suelen codificarse en las entradas de verbos en los diccionarios; y (3) una colocación especializada contiene conocimiento especializado que debe aparecer en productos terminológicos. Por otro lado, con respecto al trabajo práctico, se seleccionó el 12% de los verbos extraídos, ya que eran los que expresaban movimiento. Sin embargo, solo el 46,61% de ellos produjeron colocaciones según los criterios de extracción establecidos. Por último, después de aplicar criterios más estrictos para la clasificación de las colocaciones, solo el 25,42% de los verbos con sus colocaciones fueron recogidos en el diccionario. Además de estos resultados, se demostró la utilidad de la teoría de la Semántica de Marcos para entender el significado de los verbos y sus colocativos. En cuanto a su implementación, que era el objetivo principal de esta tesis doctoral, la inclusión de colocaciones verbales fue de suma importancia para la identificación de los distintos significados expresados por un verbo en diferentes contextos, puesto que los colocativos aportaban sutiles matices de significado. Finalmente, se concluyó que la incorporación de explicaciones sobre las combinaciones en términos legos favorece la comprensión de las entradas por parte de cualquier tipo de usuario, desde expertos a personas no especialistas, lo cual hace de DicoAdventure un producto terminológico que puede proporcionar valiosa ayuda a personas con diversa formación especializada

    Outlier Detection in Automatic Collocation Extraction

    Get PDF
    AbstractIn this paper we have analysed different association measures between words, generally used for the automatic extraction of collocations in textual corpus. Specifically, they have been considered: relative frequency, mutual information, z-score, t-score and Dunning's test. The volume of handled corpus (300000000 words) requires reviewing of the usual approach to this matter, so a solution that is based on methods used to detect statistical outliers is proposed. It is evident from the results that a lot of free combinations extracted with collocations coming from the comparison of words with very different frequencies of use. For this reason, they are applied considering that each word generates a different sample, instead of generating rankings which come from corpus considered as a single sample. The experiment is also performed on a corpus with a much smaller amount of words and the results are reported so contrasted with those obtained with the full corpus. The conclusions and contributions arising give response automatic extraction of collocations from a textual corpus regardless its volume
    corecore