671 research outputs found

    Cross-lingual AMR Aligner: Paying Attention to Cross-Attention

    Full text link
    This paper introduces a novel aligner for Abstract Meaning Representation (AMR) graphs that can scale cross-lingually, and is thus capable of aligning units and spans in sentences of different languages. Our approach leverages modern Transformer-based parsers, which inherently encode alignment information in their cross-attention weights, allowing us to extract this information during parsing. This eliminates the need for English-specific rules or the Expectation Maximization (EM) algorithm that have been used in previous approaches. In addition, we propose a guided supervised method using alignment to further enhance the performance of our aligner. We achieve state-of-the-art results in the benchmarks for AMR alignment and demonstrate our aligner's ability to obtain them across multiple languages. Our code will be available at \href{https://www.github.com/Babelscape/AMR-alignment}{github.com/Babelscape/AMR-alignment}.Comment: ACL 2023. Please cite authors correctly using both lastnames ("Mart\'inez Lorenzo", "Huguet Cabot"

    Comparing collocations in translated and learner language

    Get PDF
    This paper compares use of collocations by Italian learners writing in and translating into English, conceptualising the two tasks as different modes of constrained language production and adopting Halverson’s (2017) Revised Gravitational Pull hypothesis as a theoretical model. A particular focus is placed on identifying a method for comparing datasets containing translations and essays, assembled opportunistically and varying in size and structure. The study shows that lexical association scores for dependency-defined word pairs are significantly higher in translations than essays. A qualitative analysis of a subset of collocations shared and unique to either mode shows that the former set features more collocations with direct cross-linguistic links (connectivity), and that the source/first language seems to affect both modes similarly. We tentatively conclude that second/target language salience effects are more visible in translation than second language use, while connectivity and source language salience affect both modes of bilingual processing similarly, regardless of the mediation variable

    Coordination in telephone-based remote interpreting

    Get PDF
    Telephone-based remote interpreting has come into widespread use in multilingual encounters, all the more so in times of refugee crises and the large influx of asylum-seekers into Europe. Nevertheless, the linguistic practices in this mode of communication have not yet been examined comprehensively. This article therefore investigates selected aspects of turn-taking and clarification sequences during semi-authentic telephone-interpreted counselling sessions for refugees (Arabic–German). A quantitative analysis reveals that limited audibility makes it more difficult for interpreters to claim their turn successfully; in most cases, however, turn-taking occurs smoothly. The trouble sources that trigger queries are mainly content-related and interpreters vary greatly in the ways they deal with such difficulties. Contrary to what one might expect, the study shows that coordination fails only rarely during telephone-based remote interpreting

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas

    Comparing the production of a formula with the development of L2 competence

    Get PDF
    This pilot study investigates the production of a formula with the development of L2 competence over proficiency levels of a spoken learner corpus. The results show that the formula in beginner production data is likely being recalled holistically from learners’ phonological memory rather than generated online, identifiable by virtue of its fluent production in absence of any other surface structure evidence of the formula’s syntactic properties. As learners’ L2 competence increases, the formula becomes sensitive to modifications which show structural conformity at each proficiency level. The transparency between the formula’s modification and learners’ corresponding L2 surface structure realisations suggest that it is the independent development of L2 competence which integrates the formula into compositional language, and ultimately drives the SLA process forward

    Ditransitives in germanic languages. Synchronic and diachronic aspects

    Full text link
    This volume brings together twelve empirical studies on ditransitive constructions in Germanic languages and their varieties, past and present. Specifically, the volume includes contributions on a wide variety of Germanic languages, including English, Dutch, and German, but also Danish, Swedish, and Norwegian, as well as lesser-studied ones such as Faroese. While the first part of the volume focuses on diachronic aspects, the second part showcases a variety of synchronic aspects relating to ditransitive patterns. Methodologically, the volume covers both experimental and corpus-based studies. Questions addressed by the papers in the volume are, among others, issues like the cross-linguistic pervasiveness and cognitive reality of factors involved in the choice between different ditransitive constructions, or differences and similarities in the diachronic development of ditransitives. The volume’s broad scope and comparative perspective offers comprehensive insights into well-known phenomena and furthers our understanding of variation across languages of the same family

    Mind the source data! : Translation equivalents and translation stimuli from parallel corpora

    Get PDF
    Statements like ‘Word X of language A is translated with word Y of language B’ are incorrect, although they are quite common: words cannot be translated, as translation takes place on the level of sentences or higher. A better term for the correspondence between lexical items of source texts and their matches in target texts would be translation equivalence (Teq). In addition to Teq, there exists a reverse relation—translation stimulation (Tst), which is a correspondence between the lexical items of target texts and their matches (=stimuli) in source texts. Translation equivalents and translation stimuli must be studied separately and based on natural direct translations. It is not advisable to use pseudo-parallel texts, i.e. aligned pairs of translations from a ‘hub’ language, because such data do not reflect real translation processes. Both Teq and Tst are lexical functions, and they are not applicable to function words like prepositions, conjunctions, or particles, although it is technically possible to find Teq and Tst candidates for such words as well. The process of choosing function words when translating does not proceed in the same way as choosing lexical units: first, a relevant construction is chosen, and next, it is filled with relevant function words. In this chapter, the difference between Teq and Tst will be shown in examples from Russian–Finnish and Finnish–Russian parallel corpora. The use of Teq and Tst for translation studies and contrastive semantic research will be discussed, along with the importance of paying attention to the nature of the texts when analysing corpus findings.acceptedVersionPeer reviewe

    Translating Islamic Law: the postcolonial quest for minority representation

    Get PDF
    This research sets out to investigate how culture-specific or signature concepts are rendered in English-language discourse on Islamic, or ‘shariʿa’ law, which has Arabic roots. A large body of literature has investigated Islamic law from a technical perspective. However, from the perspective of linguistics and translation studies, little attention has been paid to the lexicon that makes up this specialised discourse. Much of the commentary has so far been prescriptive, with limited empirical evidence. This thesis aims to bridge this gap by exploring how ‘culturalese’ (i.e., ostensive cultural discourse) travels through language, as evidenced in the self-built Islamic Law Corpus (ILC), a 9-million-word monolingual English corpus, covering diverse genres on Islamic finance and family law. Using a mixed methods design, the study first quantifies the different linguistic strategies used to render shariʿa-based concepts in English, in order to explore ‘translation’ norms based on linguistic frequency in the corpus. This quantitative analysis employs two models: profile-based correspondence analysis, which considers the probability of lexical variation in expressing a conceptual category, and logistic regression (using MATLAB programming software), which measures the influence of the explanatory variables ‘genre’, ‘legal function’ and ‘subject field’ on the choice between an Arabic loanword and an endogenous English lexeme, i.e., a close English equivalent. The findings are then interpreted qualitatively in the light of postcolonial translation agendas, which aim to preserve intangible cultural heritage and promote the representation of minoritised groups. The research finds that the English-language discourse on Islamic law is characterised by linguistic borrowing and glossing, implying an ideologically driven variety of English that can be usefully labelled as a kind of ‘Islamgish’ (blending ‘Islamic’ and ‘English’) aimed at retaining symbols of linguistic hybridity. The regression analysis confirms the influence of the above-mentioned contextual factors on the use of an Arabic loanword versus English alternatives

    実応用を志向した機械翻訳システムの設計と評価

    Get PDF
    Tohoku University博士(情報科学)thesi

    Advances in monolingual and crosslingual automatic disability annotation in Spanish

    Get PDF
    Background Unlike diseases, automatic recognition of disabilities has not received the same attention in the area of medical NLP. Progress in this direction is hampered by obstacles like the lack of annotated corpus. Neural architectures learn to translate sequences from spontaneous representations into their corresponding standard representations given a set of samples. The aim of this paper is to present the last advances in monolingual (Spanish) and crosslingual (from English to Spanish and vice versa) automatic disability annotation. The task consists of identifying disability mentions in medical texts written in Spanish within a collection of abstracts from journal papers related to the biomedical domain. Results In order to carry out the task, we have combined deep learning models that use different embedding granularities for sequence to sequence tagging with a simple acronym and abbreviation detection module to boost the coverage. Conclusions Our monolingual experiments demonstrate that a good combination of different word embedding representations provide better results than single representations, significantly outperforming the state of the art in disability annotation in Spanish. Additionally, we have experimented crosslingual transfer (zero-shot) for disability annotation between English and Spanish with interesting results that might help overcoming the data scarcity bottleneck, specially significant for the disabilities.This work was partially funded by the Spanish Ministry of Science and Innovation (MCI/AEI/FEDER, UE, DOTT-HEALTH/PAT-MED PID2019-106942RB-C31), the Basque Government (IXA IT1570-22), MCIN/AEI/ 10.13039/501100011033 and European Union NextGeneration EU/PRTR (DeepR3, TED2021-130295B-C31) and the EU ERA-Net CHIST-ERA and the Spanish Research Agency (ANTIDOTE PCI2020-120717-2)
    corecore