146 research outputs found

    LUX-ASR: Building an ASR system for the Luxembourgish language

    Get PDF
    We present a first system for automatic speech recognition (ASR) for the low-resource language Luxembourgish. By applying transfer-learning, we were able to fine-tune Meta’s wav2vec2-xls-r-300m checkpoint with 35 hours of labeled Luxembourgish speech data. The best word error rate received lies at 14.47

    Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese

    Full text link
    Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese -- a low-resource language from a high-resource language family -- that by leveraging the phylogenetic information and departing from the 'one-size-fits-all' paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages

    SwissBERT: The Multilingual Language Model for Switzerland

    Full text link
    We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert

    X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity

    Full text link
    Cross-lingual transfer (XLT) is an emergent ability of multilingual language models that preserves their performance on a task to a significant extent when evaluated in languages that were not included in the fine-tuning process. While English, due to its widespread usage, is typically regarded as the primary language for model adaption in various tasks, recent studies have revealed that the efficacy of XLT can be amplified by selecting the most appropriate source languages based on specific conditions. In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models. In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources. In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.Comment: Accepted to EMNLP 2023 (Findings

    Second language learners’ self-initiated topic changes during book-related activities in preschool and their impact on Luxembourgish proficiency

    Get PDF
    The present research traces the second language learning process in Luxembourgish during book related activities by 4- to 5-year old pre-schoolers with Portuguese, Cap Verdean and Brazilian origins. With 47,2% of the preschool population being of foreign origins, the Lusophone community forms the largest group with 24,1%. This salient fast growing multilingual and multicultural population learns Luxembourgish for integration and everyday interaction and, hence, challenges public education with its diverse and altering demands. The present study enlarges second language research in the Luxembourgish context and links to previous investigation on topics, however, by taking a pragmatic stance towards topics. Through the foregrounding of the local topic management as well as its impact on activities, which are less teacher controlled, the study pictures second language learning as a product of co-constructed interaction. The focus lies on the negotiation of story meaning through self-initiated topic changes during three book related activities: Joint reading, storytelling and play. The data consists of video recorded lessons and on stimulated recall interviews with the teachers. A multi-method framework is used to investigate pupils’ interaction and language learning processes. From a quantitative point of view, the study analyses how pupils’ utterance length varies according to the openness of the lesson by allowing self-initiated topic changes as well as the design of the book activity (1) led by teachers or (2) by the pupils. From a qualitative stance, a sequence-by-sequence analysis of the jointly constructed narrative identifies the interactional dynamics of the collaborative storytelling activities and the use of self-initiated topic changes which children draw upon to express themselves more freely. The results show that children’s utterances vary according to the activity type. Pupils produce longer utterances, when they can self-initiate a topic hereby boosting their second language proficiency – either because the teacher is withdrawing or because the participation framework is open enough for them to make creative use of the language. The children also show their capability of successfully managing topic changes without the presence of the teacher while at the same time co-constructing the meaning of the story and paying attention to lexical details. The interviews reveal the teachers’ astonishment for the degree of pupil participation as well as their pedagogical practices. Implications from the analysis are gathered in a theoretical model that links opportunities for self-initiated topic changes to language proficiency. Recommendations for a more active pupil participation during book related activities point to sense-making, joint topic negotiation and story enactment.O objetivo do presente estudo que utiliza vários métodos, consiste em investigar o processo de aprendizagem do Luxemburguês de crianças entre 4 e 5 anos de idade de ascendência portuguesa, cabo verdiana e brasileira durante atividades com livros de leitura na educação infantil. 47,2% da população do jardim de infância são de origem estrangeira e a comunidade de lusófonos forma o grupo maior com 24,1%. Essa população multilíngue e multicultural, que vem crescendo rapidamente, aprende o Luxemburguês para a integração e a interação quotidiana e como resultado desafia a educação pública com suas necessidades diversificadas e alteradas. Essa pesquisa alarga estudos de luxemburguês como segunda língua e combina investigações sobre temas, mas adota uma perspetiva pragmática. Ao analisar a gestão local de tema tal e o seu impacto na atividades menos controladas pelas professoras no primeiro plano, o estudo descreve o processo de aprendizagem de luxemburguês por essas crianças como uma interação co-construída entre crianças e professores? O foco é em crianças que negociam o sentido de uma história através de trocas de temas iniciados pelas próprias crianças durante atividades com os livros de leitura: leitura conjunta, contação de histórias e brincadeira. Os dados consistem em aulas filmadas e em entrevistas de lembrança estimulada com as professoras. Vários métodos foram utilizados para investigar a interação e o aprendizado da segunda língua dos alunos. Com uma abordagem quantitativa, o estudo analisa como a extensão dos comentários das crianças varia conforme o grau de abertura da atividade que permite trocas de temas e o design da atividade do livro que é guiada 1) pelas professoras ou 2) pelos alunos. De um ângulo qualitativo, uma análise das sequências das narrativas construídas em conjunto identifica a dinâmica da interação das contações de histórias conjuntas e o uso de trocas de tema usadas pelas crianças para expressarem-se mais livremente. Os resultados apontam para a variabilidade dos comentários infantis nos dois tipos de atividades. Os alunos produziram enunciados mais longos quando iniciaram autonomamente uma troca de tema alargando o seu conhecimento linguístico com isso – ou porque a professora deixou de conduzir a tarefa rigidamente ou porque a estrutura de participação é bastante aberta que os permitem usar a língua criativamente. As crianças também revelaram a sua capacidade em fazer trocas de tema com sucesso na ausência da professora enquanto construíam conjuntamente o sentido da história e atentavam para detalhes linguísticos. As entrevistas com as professoras revelaram a sua surpresa com o grau de participação dos alunos e com a sua própria prática pedagógica. Implicações da análise juntam-se num modelo teórico que liga oportunidades para trocas de tema auto-iniciada à proficiência. O estudo sugere que uma maior participação dos alunos durante as atividades com livros de leitura pode ser otimizada pela produção de sentido, negociação de tema conjunta e brincadeira de histórias

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Full text link
    We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examplesComment: Technical repor

    Acoustic Modelling for Under-Resourced Languages

    Get PDF
    Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones. In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
    corecore