4,522 research outputs found

    Reanalyzing language expectations: Native language knowledge modulates the sensitivity to intervening cues during anticipatory processing

    Get PDF
    Issue Online:21 September 2018We investigated how native language experience shapes anticipatory language processing. Two groups of bilinguals (either Spanish or Basque natives) performed a word matching task (WordMT) and a picture matching task (PictureMT). They indicated whether the stimuli they visually perceived matched with the noun they heard. Spanish noun endings were either diagnostic of the gender (transparent) or ambiguous (opaque). ERPs were time-locked to an intervening gender-marked determiner preceding the predicted noun. The determiner always gender agreed with the following noun but could also introduce a mismatching noun, so that it was not fully task diagnostic. Evoked brain activity time-locked to the determiner was considered as reflecting updating/reanalysis of the task-relevant preactivated representation. We focused on the timing of this effect by estimating the comparison between a gender-congruent and a gender-incongruent determiner. In the WordMT, both groups showed a late N400 effect. Crucially, only Basque natives displayed an earlier P200 effect for determiners preceding transparent nouns. In the PictureMT, both groups showed an early P200 effect for determiners preceding opaque nouns. The determiners of transparent nouns triggered a negative effect at similar to 430 ms in Spanish natives, but at similar to 550 ms in Basque natives. This pattern of results supports a "retracing hypothesis" according to which the neurocognitive system navigates through the intermediate (sublexical and lexical) linguistic representations available from previous processing to evaluate the need of an update in the linguistic expectation concerning a target lexical item.Spanish Ministry of Economy and Competitiveness (MINECO), Agencia Estatal de Investigación (AEI), Fondo Europeo de Desarrollo Regional (FEDER) (grant PSI2015‐65694‐P to N. M.), Spanish Ministry of Economy and Competitiveness “Severo Ochoa” Programme for Centres/Units of Excellence in R&D (grant SEV‐2015‐490

    Mimicking Word Embeddings using Subword RNNs

    Full text link
    Word embeddings improve generalization over lexical features by placing each word in a lower-dimensional space, using distributional information obtained from unlabeled data. However, the effectiveness of word embeddings for downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which embeddings do not exist. In this paper, we present MIMICK, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributional embeddings. Unlike prior work, MIMICK does not require re-training on the original word embedding corpus; instead, learning is performed at the type level. Intrinsic and extrinsic evaluations demonstrate the power of this simple approach. On 23 languages, MIMICK improves performance over a word-based baseline for tagging part-of-speech and morphosyntactic attributes. It is competitive with (and complementary to) a supervised character-based model in low-resource settings.Comment: EMNLP 201

    Hybrid data-driven models of machine translation

    Get PDF
    Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred. In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived. Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English. On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality. Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a)

    Linguistic knowledge-based vocabularies for Neural Machine Translation

    Get PDF
    This article has been published in a revised form in Natural Language Engineering https://doi.org/10.1017/S1351324920000364. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © Cambridge University PressNeural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial PhD Grant. This work is also supported in part by the Spanish Ministerio de Economa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacin, through the postdoctoral senior grant Ramn y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).Peer ReviewedPostprint (author's final draft

    Multilingual audio information management system based on semantic knowledge in complex environments

    Get PDF
    This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users.This work is being funded by Grants: TEC201677791-C4 from Plan Nacional de I + D + i, Ministry of Economic Affairs and Competitiveness of Spain and from the DomusVi Foundation Kms para recorder, the Basque Government (ELKARTEK KK-2018/00114, GEJ IT1189-19, the Government of Gipuzkoa (DG18/14 DG17/16), UPV/EHU (GIU19/090), COST ACTION (CA18106, CA15225)

    Lexical access in bimodal bilinguals

    Get PDF
    175 p.En esta tesis se investiga el impacto de la modalidad lingüística (auditivo-oral en lenguas orales, viso-gestual en lenguas signadas) a través del papel que desempeñan las unidades sub-léxicas en el acceso léxico en castellano (observando la coactivación de la sílaba inicial y de la rima de las palabras) y en lengua de signos española (LSE) (estudiando la coactivación de la configuración manual y de la localización de los signos). Se realizaron varios experimentos del paradigma del. mundo visual grabando los movimientos oculares. Dos grupos de bilingües oyentes en castellano y LSE (28 signantes nativos y 28 signantes que aprendieron la LSE en la edad adulta) hicieron dos experimentos intra-lingüísticos en castellano y LSE (coactivación de una lengua a partir de estímulos de esa misma lengua) y dos inter-lingüísticos (activación paralela del castellano desde la LSE y viceversa). Un grupo de bilingües en castellano y euskera hizo también dos experimentos inter-lingüísticos. Los resultados de este estudio ayudan a identificar, por un lado, los aspectos del procesamiento del lenguaje que están condicionados por la presencia de la señal lingüística (palabras que se oyen o signos que se ven) y, por otro, los aspectos relacionados con las propiedades intrínsecas de cada lengua.Basque Center on Cognition, Brain and Languag

    Semisupervised Speech Data Extraction from Basque Parliament Sessions and Validation on Fully Bilingual Basque–Spanish ASR

    Get PDF
    In this paper, a semisupervised speech data extraction method is presented and applied to create a new dataset designed for the development of fully bilingual Automatic Speech Recognition (ASR) systems for Basque and Spanish. The dataset is drawn from an extensive collection of Basque Parliament plenary sessions containing frequent code switchings. Since session minutes are not exact, only the most reliable speech segments are kept for training. To that end, we use phonetic similarity scores between nominal and recognized phone sequences. The process starts with baseline acoustic models trained on generic out-of-domain data, then iteratively updates the models with the extracted data and applies the updated models to refine the training dataset until the observed improvement between two iterations becomes small enough. A development dataset, involving five plenary sessions not used for training, has been manually audited for tuning and evaluation purposes. Cross-validation experiments (with 20 random partitions) have been carried out on the development dataset, using the baseline and the iteratively updated models. On average, Word Error Rate (WER) reduces from 16.57% (baseline) to 4.41% (first iteration) and further to 4.02% (second iteration), which corresponds to relative WER reductions of 73.4% and 8.8%, respectively. When considering only Basque segments, WER reduces on average from 16.57% (baseline) to 5.51% (first iteration) and further to 5.13% (second iteration), which corresponds to relative WER reductions of 66.7% and 6.9%, respectively. As a result of this work, a new bilingual Basque–Spanish resource has been produced based on Basque Parliament sessions, including 998 h of training data (audio segments + transcriptions), a development set (17 h long) designed for tuning and evaluation under a cross-validation scheme and a fully bilingual trigram language model.This work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22)

    The Role of Orthotactics in Language Switching: An ERP Investigation Using Masked Language Priming

    Get PDF
    It is commonly accepted that bilinguals access lexical representations from their two languages during language comprehension, even when they operate in a single language context. Language detection mechanisms are, thus, hypothesized to operate after the stage of lexical access during visual word recognition. However, recent studies showed reduced cross-language activation when sub-lexical properties of words are specific to one of the bilingual’s two languages, hinting at the fact that language selection may start before the stage of lexical access. Here, we tested highly fluent Spanish–Basque and Spanish–English bilinguals in a masked language priming paradigm in which first language (L1) target words are primed by unconsciously perceived L1 or second language (L2) words. Critically, L2 primes were either orthotactically legal or illegal in L1. Results showed automatic language detection effects only for orthotactically marked L2 primes and within the timeframe of the N250, an index of sub-lexical-to-lexical integration. Marked L2 primes also affected the processing of L1 targets at the stage of conceptual processing, but only in bilinguals whose languages are transparent. We conclude that automatic and unconscious language detection mechanisms can operate at sub-lexical levels of processing. In the absence of sub-lexical language cues, unconsciously perceived primes in the irrelevant language might not automatically trigger post-lexical language identification, thereby resulting in the lack of observable language switching effects

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations

    EUSMT: incorporating linguistic information to SMT for a morphologically rich language. Its use in SMT-RBMT-EBMT hybridation

    Get PDF
    148 p.: graf.This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language. First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data. Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline. Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps. This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language. First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data. Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline. Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.Eusko Jaurlaritzaren ikertzaileak prestatzeko beka batekin (BFI05.326)eginda
    corecore