215 research outputs found

    Carvalho: un sistema de traducción estadística inglés-galego construído a partir del corpus paralelo inglés-portugués EuroParl

    Get PDF
    Para poder construir sistemas de traducción estadística es preciso contar con corpora paralelos suficientemente relevantes. No existe en estos momentos suficientes corpus paralelos entre el par de lenguas inglés-gallego. Siguiendo las teorías de importantes romanistas como Eugene Coseriu o Cunha & Cintra que gallego, portugués y brasileño son tres variedades del mismo sistema lingüístico y puesto que la variante portuguesa si que tiene estos corpus, en este proyecto investigamos si podemos usar el corpus EUROPARL inglés-portugués para conseguir un ingenio de traducción estadística entre el inglés-galego. Para conseguir esto, convertimos los corpus inglés-portugués a inglés-gallego usando un traductor RBMT Opentrad portugués- gallego. Las palabras no detectadas por el traductor son enviadas a un conversor ortográfico entre la grafía etimológica e histórica que usa el portugués y la grafía castellanizada del gallego. Posteriormente mediante Moses y Giza++ conseguimos modelos de lenguaje de nuestro prototipo. Los resultados obtenidos nos permiten pensar en la posibilidad de usar recursos lingüístico-computacionais del portugués para construír recursos, herramientasy aplicaciones para el gallego normativo ILG-RAG.In order to build reliable Statistical Machine Translation (SMT) engines between two languages it is essential to use a significantly big amount of parallel corpora. Since available English-Galician parallel corpora are not yet sufficient, it is obvious that other strategies must be followed. Important Romanicists, such as Coseriu (1987) or Cunha & Cintra (2002) have theorized that Galician and Portuguese are two varieties of European Portuguese. From a Computational Linguistics practical stand point, this assumption opens a new line of research that potentially supplies Galician with huge amount of computational resources from both European and Brazilian Portuguese. Thus, drawing from the English-Portuguese Europarl parallel corpus, imaxin|software has built a English-Galician Phrase-based Statistical Machine Translation prototype. To achieve that, the English-Portuguese parallel corpus was first converted into English-Galician using a Opentrad Portuguese Galician Rule-based Machine Translation (RBMT) engine and a spelling converter. Secondly, using Moses, Kohen et al. (2007), and GIZA++, Och & Ney (2003) we built the English-Galician translations and language models of our prototype. The results obtained allow us to conclude that SMT tools based on Galician can be drawn from Portuguese resources, which otherwise would have been an unthinkable task due to the lack of English-Galician parallel corpora. We can also conclude that this strategy can be implemented to develop a great variety of computational tools for Galician language

    Report on first selection of resources

    Get PDF
    The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.Peer ReviewedPreprin

    Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources

    Get PDF
    International audienceWe introduce a generic approach for transferring part-of-speech annotations from a resourced language to a non-resourced but etymologically close language. We first infer a bilingual lexicon between the two languages with methods based on character similarity, frequency similarity and context similarity. We then assign part-of-speech tags to these bilingual lexicon entries and annotate the remaining words on the basis of suffix analogy. We evaluate our approach on five language pairs of the Iberic peninsula, reaching up to 95% of precision on the lexicon induction task and up to 85% of tagging accuracy

    Development and Pedagogical Applications of an Audio-Textual English-Spanish Parallel Literary Corpus for the Study of English Phonology

    Get PDF
    The field of Data-Driven Learning (DDL) an approach to second language learning in which the student interacts directly with corpus data has made much progress in only the matter of a few decades. However, there are still certain frontiers that have thus far remained underexplored, mostly the result of limited technological capabilities for a good portion of the fields existence. Until now, DDL has mainly centered on text corpora, leaving aside such aspects of language learning as oral comprehension and speech production. This doctoral dissertation presents the LITTERA corpus, and examines in depth how this English-Spanish parallel literary speech corpus can be applied to language learning within the framework of DDL. The dissertation begins with a general overview of the current state of DDL, followed by a detailed description of the creation and design of the LITTERA crorpus. Then a series of potential pedagogical exercises are presented, aimed at showing how LITTERA can be applied to the learning of English phonology by Spanish-speaking students. The exercises set out to examine how the different features of English prosodyco-articulatory phenomena such as linking, blending, assimilation, elision, resyllabfication, palatization, as well as vowel reductioncan be studied in the data to improve students oral comprehension and speech production. Furthermore, possible DDL question prompts are proposed to explore the different features in the classroom

    The role of syntactic dependencies in compositional distributional semantics

    Get PDF
    This article provides a preliminary semantic framework for Dependency Grammar in which lexical words are semantically defined as contextual distributions (sets of contexts) while syntactic dependencies are compositional operations on word distributions. More precisely, any syntactic dependency uses the contextual distribution of the dependent word to restrict the distribution of the head, and makes use of the contextual distribution of the head to restrict that of the dependent word. The interpretation of composite expressions and sentences, which are analyzed as a tree of binary dependencies, is performed by restricting the contexts of words dependency by dependency in a left-to-right incremental way. Consequently, the meaning of the whole composite expression or sentence is not a single representation, but a list of contextualized senses, namely the restricted distributions of its constituent (lexical) words. We report the results of two large-scale corpus-based experiments on two different natural language processing applications: paraphrasing and compositional translationThis work is funded by Project TELPARES, Ministry of Economy and Competitiveness (FFI2014-51978-C2-1-R), and the program “Ayuda Fundación BBVA a Investigadores y Creadores Culturales 2016”S

    The Translation of Lexicalized Metaphors in Interlinguistic and Intercultural Communication of Financial Security Discourse: A Corpus-Based Analysis of English and Spanish Texts about Money Laundering

    Get PDF
    [EN]Financial crime is a significant factor in most transnational crime in general and is wide- reaching.Many critical stakeholders use specific metaphors in their communications to communicate security threats.Metaphors are often idiomatic speech that does not transfer easily from one language to another because they originate from cultural concepts. Within the public safety, regulatory and compliance community, key stakeholders from different linguistic backgrounds use English as a contact language to interact with their counterparts, the media, the public, and stakeholders to ensure regulatory compliance. Translating metaphors requires a special set of skills acquired through deep cultural knowledge and experience in both source and target cultures. The beginning of our research emanated from observing how language played a crucial role in relationships between everyone involved in the criminal justice process, not limited to the United States but also in a multitude of Spanish-speaking countries and geographical regions. Highly effective communication is critical for those who regulate against it, those involved in compliance initiatives, law enforcement, and the general public to better recognize and prevent money laundering. This project’s genesis came from interpreting criminal cases, translating documents in United States federal court cases, and observing how investigators followed the money trail to uncover illegal activity. The first-hand view of communications in that realm revealed how language played a crucial role in relationships between everyone involved in the criminal justice process, not only in the United States but also in many Spanish-speaking countries and geographical regions. Before this study, there has been little to no research on translating metaphors in the specialized regulatory financial compliance and enforcement language. The present study begins to fill that gap in research by providing a synchronic X-ray view of the current language spoken in that field through a corpus-based translation analysis of anti-money laundering texts. We developed a bilingual English- to-Spanish unidirectional corpus which we uploaded to Sketch Engine for analysis. Finally, we analyze and discuss translation techniques from English to Spanish and terminological findings. We found instances of intensifying metaphors from the source to target texts and adding or inserting metaphorical expressions in the target text where none were present in the source. We also found an ideological presence in translated expressions, consistent with other investigations involving security discourse. Finally, we found terminological inconsistencies in the metaphors for money laundering, tax haven, and shell company. We suggest practical implications for translators and stakeholders in the anti-money laundering discipline. We also provide pedagogical applications from custom building corpora and teaching translation of metaphors in the specialized financial regulation and compliance language. Developing specialized corpora and learning to use corpus-based translation analysis software will help translation students be better prepared for and improve the future of translation studies and their applications in specialized areas and beyond. Providing students with experience using linguistic analysis software will also help build critical technology skills that they will be able to apply across disciplines in the humanities and beyond, such as intelligence analysis and computer science. [ES]La delincuencia financiera es un factor relevante en la mayoría de los delitos transnacionales en general y tiene un gran alcance. Muchas personas interesadas utilizan metáforas específicas en sus comunicaciones para transmitir las amenazas a la seguridad. Las metáforas suelen ser expresiones idiomáticas que no se transmiten fácilmente de una lengua a otro debido a que tienen su origen en conceptos culturales. En lo que respecta a la seguridad pública, la reglamentación y el cumplimiento de la normativa, los principales interesados de diferentes orígenes lingüísticos utilizan el inglés como lengua de contacto para interactuar con sus homólogos, los medios de comunicación, el público y las partes interesadas para asegurar el cumplimiento de la normativa. La traducción de metáforas requiere un conjunto especial de habilidades adquiridas a través de un profundo conocimiento cultural y experiencia, tanto en la cultura de origen como en la de destino. El comienzo de nuestra investigación se debió a la observación de cómo el idioma desempeñaba un papel fundamental en las relaciones entre todos los implicados en el proceso de justicia penal, no solo en Estados Unidos, sino también en diversos países y regiones geográficas de habla hispana. Una comunicación altamente eficaz es esencial para que aquellos que regulan la lucha contra el blanqueo de capitales, quienes participan en iniciativas de cumplimiento de la normativa, las fuerzas y cuerpos de seguridad, así como el público en general, reconozcan y prevengan mejor el blanqueo de capitales. La génesis de este proyecto se remonta a la interpretación de causas penales, la traducción de documentos en casos de tribunales federales de Estados Unidos y la observación de cómo los investigadores seguían el rastro del dinero para descubrir actividades ilegales. La visión de primera mano de las comunicaciones en ese ámbito reveló cómo el idioma desempeñaba un papel fundamental en las relaciones entre todos los involucrados en el proceso de justicia penal, no solo en Estados Unidos, sino también en muchos países y regiones geográficas de habla hispana. Antes de este trabajo, apenas se había investigado la traducción de metáforas en el lenguaje especializado del cumplimiento y la aplicación de la normativa financiera. El presente estudio comienza a aclarar esa laguna en la investigación al ofrecer una radiografía sincrónica de la lengua que se habla actualmente en ese ámbito, a través de un análisis de la traducción de textos contra el blanqueo de capitales basado en un corpus. Desarrollamos un corpus unidireccional bilingüe inglés- español que hemos subido a Sketch Engine para su análisis. A continuación, se examinan y discuten las técnicas de traducción del inglés al español y los descubrimientos terminológicos. Encontramos casos en los que se intensifican las metáforas de los textos de origen a los de destino y se añaden o insertan expresiones metafóricas en el texto de destino en lugares en los que no se habían utilizado. Asimismo, observamos una presencia ideológica en las expresiones traducidas, de acuerdo con otras investigaciones sobre el discurso de la seguridad. Por último, nos encontramos con incongruencias terminológicas en las metáforas de blanqueo de capitales, paraíso fiscal y compañía de Shell. Nos sugerimos implicaciones prácticas para los traductores y las partes interesadas en la disciplina de la lucha contra el blanqueo de capitales. Asimismo, ofrecemos aplicaciones pedagógicas a través de la creación de corpus personalizados y la enseñanza de la traducción de metáforas en el lenguaje especializado de la regulación y el cumplimiento financiero. El desarrollo de corpus especializados y el aprendizaje de utilizar software de análisis de traducción basado en corpus ayudarán a los estudiantes de traducción a estar mejor preparados, así como también mejorarán el futuro de los estudios de traducción y sus aplicaciones en áreas especializadas y más allá. El brindar a los estudiantes experiencia en el uso de nuevos programas informáticos de análisis lingüístico también contribuirá a desarrollar aptitudes tecnológicas críticas que podrán aplicar en otras disciplinas de las humanidades y más allá, como el análisis de inteligencia y la informática

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    LextPT: A reliable and efficient vocabulary size test for L2 Portuguese proficiency

    Get PDF
    Vocabulary size has been repeatedly shown to be a good indicator of second language (L2) proficiency. Among the many existing vocabulary tests, the LexTALE test and its equivalents are growing in popularity since they provide a rapid (within 5 minutes) and objective way to assess the L2 proficiency of several languages (English, French, Spanish, Chinese, and Italian) in experimental research. In this study, expanding on the standard procedure of test construction in previous LexTALE tests, we develop a vocabulary size test for L2 Portuguese proficiency: LextPT. The selected lexical items fall in the same frequency interval in European and Brazilian Portuguese, so that LextPT accommodates both varieties. A large-scale validation study with 452 L2 learners of Portuguese shows that LextPT is not only a sound and effective instrument to measure L2 lexical knowledge and indicate the proficiency of both European and Brazilian Portuguese, but is also appropriate for learners with different L1 backgrounds (e.g. Chinese, Germanic, Romance, Slavic). The construction of LextPT, apart from joining the effort to provide a standardised assessment of L2 proficiency across languages, shows that the LexTALE tests can be extended to cover different varieties of a language, and that they are applicable to bilinguals with different linguistic experience
    corecore