12,600 research outputs found

    Logographic Information Aids Learning Better Representations for Natural Language Inference

    Full text link
    Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.Comment: accepted by aacl finding

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Word Reading, Reading Comprehension, and Eye Movements During Reading in Chinese Persons with Aphasia

    Get PDF
    Individuals with aphasia (IWA) often exhibit challenges in single word reading as well as in reading comprehension. Recently, eye-tracking technology has become instrumental in delving deeper into reading behaviors. Specifically, it has illuminated the differences in word reading and comprehension abilities among aphasic English speakers. However, there is a noticeable scarcity of research focusing on these aspects among Chinese IWA. The current study aimed to contrast the abilities of Chinese IWA and neurotypical controls in reading single words, with an emphasis on types like regular, irregular, and pseudowords, and reading comprehension abilities. Further, this study investigated the patterns of eye movements during paragraph reading, paying special attention to measures such as fixation durations and saccades. This study also examined the association of these eye-tracking measures with reading comprehension across both cohorts. The results indicate that the control group read more accurately across all word types compared to the IWA. The results also indicated that the IWA group exhibited longer fixation durations, more frequent fixations, and shorter saccade amplitudes when compared to the control group. Moreover, the control group consistently demonstrated superior reading comprehension accuracy across both language assessment and eye-tracking tasks. Notably, among the IWA, there were significant correlations between reading comprehension and both regular and irregular word reading. This association persisted even after rigorous statistical corrections. However, such correlations were absent in the control group. Further multiple regression analysis revealed that, even after controlling for education level and months post-stroke, a composite of regular and irregular word reading accounted for 60% and 58.5% of the variance in reading comprehension for the IWA and controls, respectively. The pronounced influence of regular and irregular word reading on comprehension in IWA suggests potential avenues for targeted reading strategies or interventions. In conclusion, this research highlights the complexity of reading comprehension, suggesting a need for a holistic approach in future studies to explore various factors influencing reading in Chinese IWA and neurotypical individuals

    Parafoveal processing of orthographic, morphological, and semantic information during reading Arabic: A boundary paradigm investigation

    Get PDF
    Evidence shows that skilled readers extract information about upcoming words in the parafovea. Using the boundary paradigm, we investigated native Arabic readers\u27 processing of orthographic, morphological, and semantic information available parafoveally. Target words were embedded in frame sentences, and prior to readers fixating them, one of the following previews were made available: (a) Identity preview; (b) Preview that shared the pattern morpheme with the target; (c) Preview that shared the root morpheme with the target; (d) Preview that was a synonym with the target word; (e) Preview with two of the root letters were transposed thus creating a new root, while preserving all letter identities of the target; (f) Preview with two of the root letters were transposed thus creating a pronounceable pseudo root, while also preserving all letter identities of the target; and (g) Previews that was unrelated to the target word and shared no information with it. The results showed that identity, root-preserving, and synonymous preview conditions yielded preview benefit. On the other hand, no benefit was obtained from the pattern-preserving previews, and significant disruption to processing was obtained from the previews that contained transposed root letters, particularly when this letter transposition created a new real root. The results thus reflect Arabic readers\u27 dependance on morphological and semantic information, and suggest that these levels of representation are accessed as early as orthographic information. Implications for theory- and model-building, and the need to accommodate early morphological and semantic processing activities in more comprehensive models are further discussed. Copyright
    corecore