4 research outputs found

    Identifying Semantic Divergences Across Languages

    Get PDF
    Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently divergein meaning, often in systematic ways. In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision. We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) is identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data helps in training a neural machine translation system twice as fast without sacrificing quality

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    Concordancing Software in Practice: An investigation of searches and translation problems across EU official languages

    Get PDF
    2011/2012The present work reports on an empirical study aimed at investigating translation problems across multiple language pairs. In particular, the analysis is aimed at developing a methodological approach to study concordance search logs taken as manifestations of translation problems and, in a wider perspective, information needs. As search logs are a relatively unexplored data type within translation process research, a controlled environment was needed in order to carry out this exploratory analysis without incurring in additional problems caused by an excessive amount of variables. The logs were collected at the European Commission and contain a large volume of searches from English into 20 EU languages that staff translators working for the EU translation services submitted to an internally available multilingual concordancer. The study attempts to (i) identify differences in the searches (i.e. problems) based on the language pairs; and (ii) group problems into types. Furthermore, the interactions between concordance users and the tool itself have been examined to provide a translation-oriented perspective on the domain of Human-Computer Interaction. The study draws on the literature on translation problems, Information Retrieval and Web search log analysis, moving from the assumption that in the perspective of concordance searching, translation problems are best interpreted as information needs for which the concordancer is chosen as a form of external support. The structure of a concordance search is examined in all its parts and is eventually broken down into two main components: the 'Search Strategy' component and the 'Problem Unit' component. The former was further analyzed using a mainly quantitative approach, whereas the latter was addressed from a more qualitative perspective. The analysis of the Problem Unit takes into account the length of the search strings as well as their content and linguistic form, each addressed with a different methodological approach. Based on the understanding of concordance searches as manifestations of translation problems, a user- centered classification of translation-oriented information needs is developed to account for as many "problem" scenarios as possible. According to the initial expectations, different languages should experience different problems. This assumption could not be verified: the 20 different language pairs considered in this study behaved consistently on many levels and, due to the specific research environment, no definite conclusions could be reached as regards the role of the language family criterion for problem identification. The analysis of the 'Problem Unit' component has highlighted automatized support for translating Named Entities as a possible area for further research in translation technology and the development of computer-based translation support tools. Finally, the study indicates (concordance) search logs as an additional data type to be used in experiments on the translation process and for triangulation purposes, while drawing attention on the concordancer as a type of translation aid to be further fine-tuned for the needs of professional translators. ***Il presente lavoro consiste in uno studio empirico sui problemi di traduzione che emergono quando si considerano diverse coppie di lingue e in particolare sviluppa una metodologia per analizzare i log di ricerche effettuate dai traduttori in un software di concordanza (concordancer) quali manifestazioni di problemi di traduzione che, visti in una prospettiva più ampia, si possono anche considerare dei "bisogni d'informazione" (information needs). I log di ricerca costituiscono una tipologia di dato ancora relativamente nuova e inesplorata nell'ambito delle ricerche sul processo di traduzione e pertanto è emersa la necessità di svolgere un'analisi di tipo esplorativo in un contesto controllato onde evitare le problematiche aggiuntive derivanti da un numero eccessivo di variabili. I log di ricerca sono stati raccolti presso la Commissione europea e contengono quantitativi ingenti di ricerche effettuate dai traduttori impiegati presso i servizi di traduzione dell'Unione europea in un concordancer multilingue disponibile come risorsa interna. L'analisi si propone di individuare le differenze nelle ricerche (e quindi nei problemi) a seconda della coppia di lingue selezionata e di raggruppare tali problemi in tipologie. Lo studio fornisce inoltre informazioni sulle modalità di interazione tra gli utenti e il software nell'ambito di un contesto traduttivo, contribuendo alla ricerca nel campo dell'interazione uomo-macchina (Human-Computer Interaction). Il presente studio trae spunto dalla letteratura sui problemi di traduzione, sull'estrazione d'informazioni (Information Retrieval) e sulle ricerche nel Web e si propone di considerare i problemi di traduzione associati all'impiego di uno strumento per le concordanze quali bisogni di informazione per i quali lo strumento di concordanze è stato scelto come forma di supporto esterna. Ogni singola ricerca è stata esaminata e scomposta in due elementi principali: la "strategia di ricerca" (Search Strategy) e l'"unità problematica" (Problem Unit) che vengono studiati rispettivamente usando approcci prevalentemente di tipo quantitativo e qualitativo. L'analisi dell'unità problematica prende in considerazione la lunghezza, il contenuto e la forma linguistica delle stringhe, analizzando ciascuna con una metodologia di lavoro appositamente studiata. Avendo interpretato le ricerche di concordanze quali manifestazioni di bisogni d'informazione, l'analisi prosegue con la definizione di una serie di categorie di bisogni d'informazione (o problemi) legati alla traduzione e incentrati sul singolo utente al fine di includere quanti più scenari di ricerca possibile. L'assunto iniziale in base al quale lingue diverse manifesterebbero problemi diversi non è stato verificato empiricamente in quanto le 20 coppie di lingue esaminate hanno mostrato comportamenti alquanto similari nei diversi livelli di analisi. Vista la peculiarità dei dati utilizzati e la specificità dell'Unione europea come contesto di ricerca, non è stato possibile ottenere conclusioni definitive in merito al ruolo delle famiglie linguistiche quali indicatori di problemi, rispetto ad altri criteri di classificazione. L'analisi dell'unità problematica ha evidenziato le entità denominate (Named Entities) quale possibile oggetto di futuri progetti di ricerca nell'ambito delle tecnologie della traduzione. Oltre a offrire un contributo per i futuri sviluppi nell'ambito dei supporti informatici alla traduzione, con il presente studio si è voluto altresì presentare i log delle ricerche (di concordanze) quale tipologia aggiuntiva di dati per lo studio del processo di traduzione e per la triangolazione dei risultati empirico-sperimentali, cercando anche di suggerire possibili tratti migliorativi dei software di concordanza sulla base dei bisogni di informazione riscontrati nei traduttori.XXV Ciclo198
    corecore