Search CORE

5,353 research outputs found

Corpus-Based Critical Discourse Analysis of Women’s Representation in Shen Bao (1872-1949) and People’s Daily (1950-2012)

Author: XU TIANHUA
Publication venue
Publication date: 01/01/2022
Field of study

This thesis aims to explore and analyse women’s representations in Shen Bao (1872-1949) and People’s Daily (1950-2012) in China over a period of 140 years (1872-2012). Combining the quantitative corpus analysis of 1.9 million words of data with qualitative analyses using critical discourse analysis (CDA), it examines four distinctive historical eras in the press portrayal of women: late imperial Qing (1872–1911), Republican (1912–1949), socialist (1950-1978) and the post-socialist (1979-2012). During these 140 years, China experienced dramatic sociocultural shifts and political transformations under the guidance of different ideologies over this crucial historical time. Women were placed right in the centre of this turmoil, and women’s roles have continuously been renewed, recreated, defended and modified (Williams, 1977). Women were deemed inferior to men were nothing more than the result of social constructions. Women’s representations are embedded in ideological frameworks supported by existing power relations in the patriarchal society. They operated in the symbolic world through discursive construction that defines women in ways that shape the social understanding of their role, status and identities. This construction of women by the dominant forces in society serves to sustain the existing patriarchal power relations. The thesis focuses on newspapers because of its central role in shaping public opinions, setting agendas, and maintaining power structure Broadsheet newspapers have the power to define key issues, topics, and situations which gives them ideological power. CDA pays attention to both the macro-level of context through a top-down approach, and the micro-level by analysing how ideologies, dominance and power relations are expressed in language. In contrast, Corpus Linguistics (CL) deals with large amounts of text by providing detailed information of the micro-level. CL is basically a bottom-up approach, allowing the data generated in a corpus to take the lead, and thereby limits bias. The data generated by corpus analytical tools in CL is not handpicked data selected by the analyst, it is typical and representative linguistic patterns that have been extracted from a large amount of data. Women’s representations have undergone significant transformations across the four historical eras in China as some women gain more economic independence and could challenge the power hierarchies. In the late Qing era, women were not described as the opposite gender of vi men, but are represented as the weak, incompetent, decadent, and pathological symbol of premodernity in Shen Bao. Articles in Shen Bao promoted representations of women as “Mothers of the Nation” and “Heroines”, which are variations of traditional “good wife and mother” and “devoted to husband and son” sugar-coated with modern nationalism. In the socialist era, women were mostly represented as strong, masculine, selfless, and ideologically correct workers in the labour force, and as emotionally and physically the same as men. Women lived and breathed for the state, and were willing to devote their lives, youth and efforts to communism and socialism. In the post-socialist era, women’s representations in the People’s Daily are more diverse. Discourses on women throughout the 140 years acted as a tool to legitimize various national agendas. This study offers empirical evidence and provides a macro level picture of the transformation of women’s representations in the 140 years of history, underpinning the drive behind; also a micro level analysis of detailed discussion on the confliction and consistencies of women discourse over the four historical eras. Women’s studies have their origin outside of China, in the west. I hope this study will shed some light onto the many components of the scarcely researched localization of west women theories into Chinese terms, which I believe is the next important issue and the next biggest challenge in women’s studies in China

Durham e-Theses

The Use of English Transition Markers in Chinese and British University Student Writing

Author: Han Chao Han
Publication venue
Publication date: 01/01/2018
Field of study

Chinese students are the largest group of overseas students in the UK (Leedham 2015), so various studies have been conducted to compare their academic writing with native English speakers’. Metadiscourse resources are very important devices to show how the writer responds to his or her potential readers (Hyland 2005; Ädel 2006), but little research has been carried out to examine how Chinese and English student writers employ them in detail in their assignments. Furthermore, fewer studies have been carried out to compare the writing of the two groups of students with highly-matched texts. The present study was carried out to investigate Chinese and English student writing using a highly-matched corpus in terms of level, discipline, and genre family. It aimed to identify transitions and the use of transitions in student academic writing. The findings show similarities in the writing of the Chinese and English students. They both tended to use transitions more frequently in non-science disciplines (e.g. Law and Linguistics) and discursive genre families (e.g. Critique and Essay), while they both tended to employ less frequently in science disciplines (e.g. Food Science and Biology) and in technical genre families (e.g. Methodology Recount and Design Specification). Since English students are native English speakers and they may have greater prior exposure to academic writing, their writing reflects better understanding of the transition items in terms of meaning and formality. On the other hand, since Chinese students are non-native English speakers, they have more English grammar courses before their undergraduate education. As a result, the use of punctuation with transitions is more accurate in their writing. Furthermore, English students appear to be more sophisticated in their use of co-occurring transitions (e.g. and thus, but nevertheless). This has not been previously revealed in the literature. Both groups of students make both appropriate and inappropriate use of transitions which are worthy of note

Coventry University Pure Portal

Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

Author: Hana Jiří
Jelínek Tomáš
Rosen Alexandr
Vidová Hladká Barbora
Škodová Svatava
Štindlová Barbora
Publication venue: 'Charles University in Prague, Karolinum Press'
Publication date: 01/01/2020
Field of study

Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

CU Digital Repository

Directory of Open Access Books (DOAB)

A lexicographic approach to profiling nomenclature usage in the biodiversity literature

Author: Young Sandra
Publication venue
Publication date: 01/01/2020
Field of study

University of Brighton Research Portal

The text classification pipeline: Starting shallow, going deeper

Author: SIINO Marco
Publication venue: place:Palermo
Publication date: 20/11/2023
Field of study

An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

Archivio istituzionale della ricerca - Università di Palermo

A corpus based, lexical analysis of patient information for radiography

Author
Publication venue: 'Swansea University'
Publication date: 01/01/2019
Field of study

Despite the importance and the ubiquity of medical patient information in many healthcare systems in the world, we know very little about the lexical characteristics of the register. We do not know how patients perceive the information in the leaflets or whether the messages are transmitted effectively and fully understood. How a medical authority instructs and obliges patients in written information is also unclear. While the number of radiographic examinations performed globally increases year on year, studies consistently show that patients lack basic knowledge regarding the commonly-performed exams and show very poor understanding of the concomitant risks associated with radiation. There is, then, a pressing need to investigate radiography patient information in order to better understand why, and where, it is less effective. This thesis applies three approaches common to the field of corpus linguistics to uncover some of the lexical characteristics of patient information for radiography. The approaches used in this thesis are a keyword extraction, a lexical bundles analysis and an investigation of modal verbs used to express obligation. The findings suggest that patient information for radiography possesses characteristics more common to academic prose than conversation, although the high informational content of the register goes some way to explaining this and suggests that the reliance on these structures may, to a certain extent, be unavoidable. Results also suggest that the reliance on should to oblige and instruct is problematic as it may cause interpretation problems for certain patients, including those for whom English is not a primary language. Certain other characteristics of patient information revealed by the analyses may also cause comprehension, and while further research is needed, none of these characteristics would be evaluated as problematic by standard readability measures, furthering doubts about the suitability of such measures for the evaluation of medical information

Cronfa at Swansea University

The Development of a Workflow in Exploring Non-linear Interpretations of Textual Spaces

Author: Liu Yisi
Publication venue
Publication date
Field of study

University of Liverpool Repository

Application of Common Sense Computing for the Development of a Novel Knowledge-Based Opinion Mining Engine

Author: Erik Cambria
Publication venue: University of Stirling
Publication date: 16/12/2011
Field of study

The ways people express their opinions and sentiments have radically changed in the past few years thanks to the advent of social networks, web communities, blogs, wikis and other online collaborative media. The distillation of knowledge from this huge amount of unstructured information can be a key factor for marketers who want to create an image or identity in the minds of their customers for their product, brand, or organisation. These online social data, however, remain hardly accessible to computers, as they are specifically meant for human consumption. The automatic analysis of online opinions, in fact, involves a deep understanding of natural language text by machines, from which we are still very far. Hitherto, online information retrieval has been mainly based on algorithms relying on the textual representation of web-pages. Such algorithms are very good at retrieving texts, splitting them into parts, checking the spelling and counting their words. But when it comes to interpreting sentences and extracting meaningful information, their capabilities are known to be very limited. Existing approaches to opinion mining and sentiment analysis, in particular, can be grouped into three main categories: keyword spotting, in which text is classified into categories based on the presence of fairly unambiguous affect words; lexical affinity, which assigns arbitrary words a probabilistic affinity for a particular emotion; statistical methods, which calculate the valence of affective keywords and word co-occurrence frequencies on the base of a large training corpus. Early works aimed to classify entire documents as containing overall positive or negative polarity, or rating scores of reviews. Such systems were mainly based on supervised approaches relying on manually labelled samples, such as movie or product reviews where the opinionist’s overall positive or negative attitude was explicitly indicated. However, opinions and sentiments do not occur only at document level, nor they are limited to a single valence or target. Contrary or complementary attitudes toward the same topic or multiple topics can be present across the span of a document. In more recent works, text analysis granularity has been taken down to segment and sentence level, e.g., by using presence of opinion-bearing lexical items (single words or n-grams) to detect subjective sentences, or by exploiting association rule mining for a feature-based analysis of product reviews. These approaches, however, are still far from being able to infer the cognitive and affective information associated with natural language as they mainly rely on knowledge bases that are still too limited to efficiently process text at sentence level. In this thesis, common sense computing techniques are further developed and applied to bridge the semantic gap between word-level natural language data and the concept-level opinions conveyed by these. In particular, the ensemble application of graph mining and multi-dimensionality reduction techniques on two common sense knowledge bases was exploited to develop a novel intelligent engine for open-domain opinion mining and sentiment analysis. The proposed approach, termed sentic computing, performs a clause-level semantic analysis of text, which allows the inference of both the conceptual and emotional information associated with natural language opinions and, hence, a more efficient passage from (unstructured) textual information to (structured) machine-processable data. The engine was tested on three different resources, namely a Twitter hashtag repository, a LiveJournal database and a PatientOpinion dataset, and its performance compared both with results obtained using standard sentiment analysis techniques and using different state-of-the-art knowledge bases such as Princeton’s WordNet, MIT’s ConceptNet and Microsoft’s Probase. Differently from most currently available opinion mining services, the developed engine does not base its analysis on a limited set of affect words and their co-occurrence frequencies, but rather on common sense concepts and the cognitive and affective valence conveyed by these. This allows the engine to be domain-independent and, hence, to be embedded in any opinion mining system for the development of intelligent applications in multiple fields such as Social Web, HCI and e-health. Looking ahead, the combined novel use of different knowledge bases and of common sense reasoning techniques for opinion mining proposed in this work, will, eventually, pave the way for development of more bio-inspired approaches to the design of natural language processing systems capable of handling knowledge, retrieving it when necessary, making analogies and learning from experience

Stirling Online Research Repository

El tratamiento y la representación de las colocaciones verbales en el lenguaje especializado del turismo de aventura

Author: Jiménez-Navarro Eva-Lucía
Publication venue: Universidad de Córdoba, UCOPress
Publication date: 01/01/2021
Field of study

A collocation is considered a frequent co-occurrence of two words which hold a syntactic relationship and whose elements enjoy a different status. Given their perception as a unit in language, access to the prominent word (base) involves immediate access to the other item (collocate). In terms of meaning, some combinations tend to be more transparent than others. The pervasiveness of these word associations in language has sparked a strong research interest in the last decades. A compelling reason for this approach may be the fact that they are naturally produced by native speakers but must be actively learned by non-native individuals. Not only has this reality led to their treatment in the general language, but it has also become a legitimate field of study in a wide range of specialized languages, such as the environment, computing, law or tourism, which is our object of study. As a consequence, specialized knowledge resources covering this type of word combinations have seen the light with the primary purpose of offering some extra help to people who deal with this type of language, for example, translators, linguists or other professionals. Nevertheless, there is still much to do in this respect. Taken this into account, it is hypothesized that verb collocations in the specialized language of adventure tourism convey specialized meaning that is worth being collected in terminological products. Therefore, this work endeavors, as its main purpose, to perform a deep analysis of verb collocations in this specialized domain and their implementation in the entries for motion verbs in DicoAdventure, a specialized dictionary of adventure tourism, whose inspirational idea was to highlight the significant role of verbs in the linguistic expression of concepts. Accordingly, the following theoretical objectives were set: first, to cover the linguistic branches which influence specialized lexicography; second, to define the concept of specialized collocation; and third, to examine a vast number of lexicographical and terminological resources so as to discover the items of information that would make an adequate representation of collocations in a specialized dictionary and, then, design a model for such task. Furthermore, the following practical objectives were formulated: first, to extract the motion verbs which would be the bases of the collocations implemented; second, to retrieve the lexical collocations of these verbs; and third, to classify the resulting list of collocations according to the meaning expressed, that is, actual motion or fictive (or metaphorical) motion. The practical steps taken in this research were based on the English monolingual specialized corpus ADVENCOR, which contains promotional texts about adventure tourism, and the use of corpus management software. The results of the theoretical work can be summarized as follows: (1) the specialized language of adventure tourism must be considered as specialized as any others; (2) collocations are not usually encoded in verb entries in dictionaries; and (3) a specialized collocation carries specialized knowledge which must be covered in terminological products. On the other hand, regarding the practical work, 12% of the verbs extracted were selected, as they were the ones expressing motion. However, only 46.61% of them produced collocations according to the extraction criteria established. Last, after applying more strict criteria for the collocation classification, only 25.42% of the verbs along with their collocations were collected in the dictionary. In addition to these results, the theory of Frame Semantics proved useful to understand the meaning of the verbs and their collocates. As for their implementation, which was the primary objective of this doctoral dissertation, the inclusion of verb collocations was of paramount importance for the identification of distinct meanings expressed by one verb in different contexts, as collocates conveyed subtle nuances of meaning. Finally, it was concluded that the incorporation of explanations about the combinations in lay terms facilitates the comprehension of the entries to any type of user, from experts to laypersons, which makes DicoAdventure a terminological product that can render valuable assistance to individuals with distinct specialized expertise.Una colocación es una coaparición frecuente de dos palabras que mantienen una relación sintáctica y cuyos elementos alcanzan un estatus diferente. Puesto que se perciben como una unidad del lenguaje, el acceso al elemento prominente (base) conlleva el acceso inmediato al otro componente (colocativo). Con respecto a su significado, algunas combinaciones tienden a ser más transparentes que otras. La constante presencia de las colocaciones en el lenguaje ha despertado gran interés por su investigación en las últimas décadas. Una razón convincente de este acercamiento podría ser el hecho de que los hablantes nativos las producen de forma natural, mientras que los no nativos deben aprenderlas de manera activa. Esta realidad no solo ha llevado a su tratamiento en el lenguaje general, sino también a que se hayan convertido en un campo de estudio legítimo en una amplia gama de lenguajes especializados, como son el medio ambiente, la informática, el derecho o el turismo, que es el objeto de estudio de esta investigación. Como consecuencia, se han creado recursos de conocimiento especializado con el propósito fundamental de ofrecer ayuda a las personas que interactúan con este tipo de lenguaje, por ejemplo, traductores, lingüistas u otro tipo de profesionales. No obstante, aún queda mucho por hacer en este aspecto. Teniendo esto en cuenta, la hipótesis de este trabajo se basa en la idea de que las colocaciones verbales en el lenguaje especializado del turismo de aventura expresan significados especializados que merecen ser recopilados en productos terminológicos. Por lo tanto, este trabajo tiene como principal objetivo el estudio exhaustivo de las colocaciones verbales en este campo de especialidad y su implementación en las entradas de los verbos de movimiento en DicoAdventure, un diccionario especializado del turismo de aventura, cuyo punto de partida fue la intención de destacar el importante papel que juegan los verbos en la expresión lingüística de los conceptos. Por consiguiente, se establecieron los siguientes objetivos teóricos: primero, revisar las ramas de la lingüística que ejercen una influencia en la lexicografía especializada; segundo, definir el concepto de colocación especializada; y tercero, examinar un gran número de recursos lexicográficos y terminológicos para descubrir qué tipo de información conformaría una representación adecuada de colocaciones en un diccionario especializado y, a continuación, diseñar un modelo para esta tarea. Además, se propusieron estos objetivos prácticos: primero, extraer los verbos de movimiento que serían las bases de las colocaciones implementadas; segundo, extraer las colocaciones léxicas de estos verbos; y tercero; clasificar la lista resultante de colocaciones según su significado, es decir, movimiento real o movimiento figurado (o metafórico). Los pasos prácticos que se dieron en esta investigación se llevaron a cabo mediante la gestión del corpus especializado monolingüe en inglés ADVENCOR, que contiene textos promocionales sobre el turismo de aventura, y el uso de software de gestión de corpus. Los resultados de la parte teórica del trabajo se pueden resumir de la siguiente manera: (1) el lenguaje especializado del turismo de aventura debe considerarse tan especializado como otros; (2) las colocaciones no suelen codificarse en las entradas de verbos en los diccionarios; y (3) una colocación especializada contiene conocimiento especializado que debe aparecer en productos terminológicos. Por otro lado, con respecto al trabajo práctico, se seleccionó el 12% de los verbos extraídos, ya que eran los que expresaban movimiento. Sin embargo, solo el 46,61% de ellos produjeron colocaciones según los criterios de extracción establecidos. Por último, después de aplicar criterios más estrictos para la clasificación de las colocaciones, solo el 25,42% de los verbos con sus colocaciones fueron recogidos en el diccionario. Además de estos resultados, se demostró la utilidad de la teoría de la Semántica de Marcos para entender el significado de los verbos y sus colocativos. En cuanto a su implementación, que era el objetivo principal de esta tesis doctoral, la inclusión de colocaciones verbales fue de suma importancia para la identificación de los distintos significados expresados por un verbo en diferentes contextos, puesto que los colocativos aportaban sutiles matices de significado. Finalmente, se concluyó que la incorporación de explicaciones sobre las combinaciones en términos legos favorece la comprensión de las entradas por parte de cualquier tipo de usuario, desde expertos a personas no especialistas, lo cual hace de DicoAdventure un producto terminológico que puede proporcionar valiosa ayuda a personas con diversa formación especializada

Repositorio Institucional de la Universidad de Córdoba

Application of Common Sense Computing for the Development of a Novel Knowledge-Based Opinion Mining Engine

Author: Cull A.
Fry A.
Rush Robert
Steel M.
Publication venue: University of Stirling
Publication date: 01/03/2001
Field of study

PubMed Central

Stirling Online Research Repository

Queen Margaret University eResearch

University of St. Andrews - Pure