59 research outputs found

    Hierarchical Character-Word Models for Language Identification

    Full text link
    Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong base- lines, and can also reveal code-switching

    Readability of hearing related internet information in traditional Chinese.

    Get PDF
    Hearing impairment is a prevalent issue that affects many people. However, the prevalence and consequences of hearing impairment may be prevented or managed through proper understanding of health information (El Dib & Mathew, 2012; Tsukada & Sakakibara, 2008). Now, with the advances in technology, the Internet has become a popular and convenient source for health information (Fox, 2006; Siow et al., 2003; Y. Y. Yan, 2010). Despite the convenience that is brought about by the Internet, the majority of health information can be too difficult to understand (Friedman & Hoffman-Goetz, 2006; Laplante- Levésque & Thorén, 2015). To examine this issue, readability scores have often been used to assess the reading difficulty of health information. Readability scores are calculated by readability formulas, based on measuring quantifiable textual features that contribute to reading difficulty (Dubay, 2004). For online hearing health information, most of the focus has been centered on English websites and have found them to be written at levels too difficult for the general public (Laplante-Levésque & Thorén, 2015; Svider et al., 2013). However, there are no studies on Chinese online hearing health information and yet Chinese has the largest number of native speakers in the world (Paul, 2016). Due to the limited Chinese readability formulas available, this study focused on Traditional Chinese online hearing health information using the Jing (Jing ffl, 1995) and CRIE 1.0 (Sung et al., 2016) readability formulas. A panel of 39 people with no expertise in the hearing health profession and who spoke Mandarin as their primary language, were recruited to identify keywords for the Internet search. Keywords that were mentioned more than once and returned relevant results were £}5R (ear),�]](hearing),ll�� (hearing aids),m� (hard of hearing),�7F� � (can’t hear properly). These keywords were entered into google.com.tw (Google Taiwan) and google.com.hk (Google Hong Kong) to obtain websites for readability analysis. After matching against the inclusion and exclusion criteria, 31 websites were included in the readability analysis. Health information is recommended to be written at the 6th reading grade level and according to the CRIE 1.0 formula, 25% of the websites had a reading grade level greater than 6. However, according to the Jing formula, 81% of the websites had a reading grade level greater than 6. When websites were sorted according to organization type, there was not a significant difference in reading grade level between the different type of organizations. Readability can be improved primarily by reducing the length of the paragraph and using more common characters and words. Future directions include performing readability analysis for online hearing health information written in Simplified Chinese, as the majority of Chinese speakers use Simplified Chinese. This is currently not feasible as there are no reliable readability formulas for Simplified Chinese. Also, to supplement the findings from this study, the websites should be assessed for their suitability and quality

    Electoral Reform, Distributive Politics, and Parties in the Taiwanese Congress

    Get PDF
    Estimating the preference of parties and politicians is key to understanding interparty relations, polarisation, and electoral competition. This thesis investigates how the 2008 Taiwan electoral reform from a single non-transferable voting system (SNTV) to single-member districts (SMD) impacts the behaviour of legislators and their electoral strategies by analysing historic roll calls and parliamentary questions. I present several pieces of empirical evidence to answer the following research questions: (1) Does the electoral reform mitigate intraparty competition and increase party cohesion? (2) Does the reform reduce regional particularism expressed in parliamentary questions but increase the promises of universalism policies? Last, many legislators changed careers to become municipal mayors due to the reduced number of seats after the reform. (3) Do mayors having longer years of career in congress correspondingly assist the municipalities to receive more distributive spending? The thesis applies ideal point estimation and natural language processing techniques currently deployed by the frontier study of legislatures and strengthens the understanding of Taiwan party politics and party competition inside the congress (the Legislative Yuan). With the unique legislative data covering the pre- and post-reform periods, the reform did not immediately reduce intraparty competition but shortly polarised interparty relations, leaving congress in chaos during the transition. However, legislators' incentive to run on personal votes by asking particularistic parliamentary questions decreased while attention to regulatory policies increased after the reform. Last, the thesis finds municipalities whose mayors with longer careers spent in the legislature are more likely to be allocated higher distributive benefits. The effect is even more substantial if those mayors had previously been connected to the legislative standing committees

    The Taming of the Shrew - non-standard text processing in the Digital Humanities

    Get PDF
    Natural language processing (NLP) has focused on the automatic processing of newspaper texts for many years. With the growing importance of text analysis in various areas such as spoken language understanding, social media processing and the interpretation of text material from the humanities, techniques and methodologies have to be reviewed and redefined since so called non-standard texts pose challenges on the lexical and syntactic level especially for machine-learning-based approaches. Automatic processing tools developed on the basis of newspaper texts show a decreased performance for texts with divergent characteristics. Digital Humanities (DH) as a field that has risen to prominence in the last decades, holds a variety of examples for this kind of texts. Thus, the computational analysis of the relationships of Shakespeare’s dramatic characters requires the adjustment of processing tools to English texts from the 16th-century in dramatic form. Likewise, the investigation of narrative perspective in Goethe’s ballads calls for methods that can handle German verse from the 18th century. In this dissertation, we put forward a methodology for NLP in a DH environment. We investigate how an interdisciplinary context in combination with specific goals within projects influences the general NLP approach. We suggest thoughtful collaboration and increased attention to the easy applicability of resulting tools as a solution for differences in the store of knowledge between project partners. Projects in DH are not only constituted by the automatic processing of texts but are usually framed by the investigation of a research question from the humanities. As a consequence, time limitations complicate the successful implementation of analysis techniques especially since the diversity of texts impairs the transferability and reusability of tools beyond a specific project. We answer to this with modular and thus easily adjustable project workflows and system architectures. Several instances serve as examples for our methodology on different levels. We discuss modular architectures that balance time-saving solutions and problem-specific implementations on the example of automatic postcorrection of the output text from an optical character recognition system. We address the problem of data diversity and low resource situations by investigating different approaches towards non-standard text processing. We examine two main techniques: text normalization and tool adjustment. Text normalization aims at the transformation of non-standard text in order to assimilate it to the standard whereas tool adjustment concentrates on the contrary direction of enabling tools to successfully handle a specific kind of text. We focus on the task of part-of-speech tagging to illustrate various approaches toward the processing of historical texts as an instance for non-standard texts. We discuss how the level of deviation from a standard form influences the performance of different methods. Our approaches shed light on the importance of data quality and quantity and emphasize the indispensability of annotations for effective machine learning. In addition, we highlight the advantages of problem-driven approaches where the purpose of a tool is clearly formulated through the research question. Another significant finding to emerge from this work is a summary of the experiences and increased knowledge through collaborative projects between computer scientists and humanists. We reflect on various aspects of the elaboration and formalization of research questions in the DH and assess the limitations and possibilities of the computational modeling of humanistic research questions. An emphasis is placed on the interplay of expert knowledge with respect to a subject of investigation and the implementation of tools for that purpose and the thereof resulting advantages such as the targeted improvement of digital methods through purposeful manual correction and error analysis. We show obstacles and chances and give prospects and directions for future development in this realm of interdisciplinary research

    Combining translation into the second language and second language learning : an integrated computational approach

    Get PDF
    This thesis explores the area where translation and language learning intersects. However, this intersection is not one in the traditional sense of second language teaching: where translation is used as a means for learning a foreign language. This thesis treats translating into the foreign language as a separate entity, one that is as important as learning the foreign language itself. Thus the discussion in this thesis is especially relevant to an academic institution which contemplates training foreign language learners who can perform translation into the foreign language at a professional level. The thesis concentrates on developing a pedagogical model which can achieve the goal of fostering linguistic competence and translation competence at the same time. It argues that constructing such a model under a computerised framework is a viable approach, since the task of translation nowadays relies heavily on all kinds o

    Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud

    Get PDF
    Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud \ud In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud \ud Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud \ud We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u

    Static and dynamic metaphoricity in U.S.-China trade discourse:A transdisciplinary perspective

    Get PDF
    Metaphor scholars have widely explored metaphor use in political discourse. Nevertheless, the current research does not account for the ‘gradable metaphoricity’ in political discourse analysis. This dissertation fills this gap by addressing this specific issue in two frameworks: (1) viewing political metaphor from a static and gradient perspective (Source-Target mapping; Conventional vs. Novel vs. Dead), and (2) viewing political metaphor from a gradable and dynamic perspective (a matter of salience and awareness of metaphoricity). A systematic literature review in chapter 2 points out that the static and dynamic perspectives differ significantly in underlying assumptions and organizing principles, although both are indistinctly referred to by metaphor scholars as constituting a ‘gradable’ view. The former takes metaphor as a static conceptual unit or lexical unit, but the latter tends to accord a central role of activation of metaphoricity to metaphorical expressions. To launch a theoretical advancement about the dynamic view in political discourse, chapter 3 offers a usage-based model of gradable and dynamic metaphors—the YinYang Dynamics of Metaphoricity (YYDM). In addition, this thesis investigates political metaphors from an interdisciplinary angle, incorporating theory from the field of International Relations. An empirical evaluation of political (discourse) studies in chapter 4 shows the large absence of transdisciplinary perspectives. Addressing the abovementioned gaps, this dissertation reports on two empirical analyses of trade metaphors in a big corpus that represents the official trade positions of the United States and China during the presidencies of Bill Clinton and Jiang Zemin (1993-1997) as well as Donald Trump and Xi Jinping (2017-2021). Based on a codebook of a cross-linguistic metaphor identification procedure in chapter 5, the first empirical part contributes to the static and gradient perspective and includes two corpus-based studies of metaphorical framing about trade (chapters 6-7). The diachronic and cross-linguistic use of source domains from a socio-cognitive approach in chapter 6 reveals that source domains are semantic fields that vary with trade discourse contexts (interests, power, and power relations). Chapter 7 shows that the use of trade metaphors (source domains of Conventional and Novel metaphors) to construct and legitimize political ideologies correlates with differences between political genres. The second part contributes to the gradable and dynamic view by applying the transdisciplinary model of YinYang Dynamics of Metaphoricity in chapters 8-10. In chapter 8, an evaluation of the new model in the Clinton-Jiang trade discourse shows that the dynamic cognitive process (transformation of metaphoricity) and rhetorical process (argumentation and persuasion) mutually develop with the evolution of the socio-political process (trade perspectives and trade events). Chapter 9 investigates the transformation of metaphoricity in the Trump-Xi trade discourse and finds that cognitive processes (patterns of metaphoricity activation) and affective processes (emotions or sentiments) mutually develop with the evolution of socio-political processes (trade perspectives and trade events). Based on the findings in chapters 8-9, chapter 10 further shows several phenomena in the Clinton-Jiang and Trump-Xi trade discourses: the movement of metaphors on the metaphoricity spectrum, the bodily motivation of gradable and dynamic metaphoricity, and the interconnected political discourse systems. Drawing on all the theoretical and empirical insights revealed in the dissertation, the final section of the thesis outlines a future direction, i.e., moving towards a transdisciplinary and dynamic approach to metaphor in political discourse analysis
    corecore