146 research outputs found
LUX-ASR: Building an ASR system for the Luxembourgish language
We present a first system for automatic speech recognition
(ASR) for the low-resource language Luxembourgish. By
applying transfer-learning, we were able to fine-tune Meta’s
wav2vec2-xls-r-300m checkpoint with 35 hours of labeled
Luxembourgish speech data. The best word error rate received lies at 14.47
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese
Multilingual language models have pushed state-of-the-art in cross-lingual
NLP transfer. The majority of zero-shot cross-lingual transfer, however, use
one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to
transfer to all target languages, irrespective of their typological,
etymological, and phylogenetic relations to other languages. In particular,
readily available data and models of resource-rich sibling languages are often
ignored. In this work, we empirically show, in a case study for Faroese -- a
low-resource language from a high-resource language family -- that by
leveraging the phylogenetic information and departing from the
'one-size-fits-all' paradigm, one can improve cross-lingual transfer to
low-resource languages. In particular, we leverage abundant resources of other
Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for
the benefit of Faroese. Our evaluation results show that we can substantially
improve the transfer performance to Faroese by exploiting data and models of
closely-related high-resource languages. Further, we release a new web corpus
of Faroese and Faroese datasets for named entity recognition (NER), semantic
text similarity (STS), and new language models trained on all Scandinavian
languages
SwissBERT: The Multilingual Language Model for Switzerland
We present SwissBERT, a masked language model created specifically for processing Switzerland-related text. SwissBERT is a pre-trained model that we adapted to news articles written in the national languages of Switzerland -- German, French, Italian, and Romansh. We evaluate SwissBERT on natural language understanding tasks related to Switzerland and find that it tends to outperform previous models on these tasks, especially when processing contemporary news and/or Romansh Grischun. Since SwissBERT uses language adapters, it may be extended to Swiss German dialects in future work. The model and our open-source code are publicly released at https://github.com/ZurichNLP/swissbert
X-SNS: Cross-Lingual Transfer Prediction through Sub-Network Similarity
Cross-lingual transfer (XLT) is an emergent ability of multilingual language
models that preserves their performance on a task to a significant extent when
evaluated in languages that were not included in the fine-tuning process. While
English, due to its widespread usage, is typically regarded as the primary
language for model adaption in various tasks, recent studies have revealed that
the efficacy of XLT can be amplified by selecting the most appropriate source
languages based on specific conditions. In this work, we propose the
utilization of sub-network similarity between two languages as a proxy for
predicting the compatibility of the languages in the context of XLT. Our
approach is model-oriented, better reflecting the inner workings of foundation
models. In addition, it requires only a moderate amount of raw text from
candidate languages, distinguishing it from the majority of previous methods
that rely on external resources. In experiments, we demonstrate that our method
is more effective than baselines across diverse tasks. Specifically, it shows
proficiency in ranking candidates for zero-shot XLT, achieving an improvement
of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that
confirm the utility of sub-networks for XLT prediction.Comment: Accepted to EMNLP 2023 (Findings
Second language learners’ self-initiated topic changes during book-related activities in preschool and their impact on Luxembourgish proficiency
The present research traces the second language learning process in Luxembourgish during book related activities by 4- to 5-year old pre-schoolers with Portuguese, Cap Verdean and Brazilian origins. With 47,2% of the preschool population being of foreign origins, the Lusophone community forms the largest group with 24,1%. This salient fast growing multilingual and multicultural population learns Luxembourgish for integration and everyday interaction and, hence, challenges public education with its diverse and altering demands.
The present study enlarges second language research in the Luxembourgish context and links to previous investigation on topics, however, by taking a pragmatic stance towards topics. Through the foregrounding of the local topic management as well as its impact on activities, which are less teacher controlled, the study pictures second language learning as a product of co-constructed interaction. The focus lies on the negotiation of story meaning through self-initiated topic changes during three book related activities: Joint reading, storytelling and play. The data consists of video recorded lessons and on stimulated recall interviews with the teachers. A multi-method framework is used to investigate pupils’ interaction and language learning processes. From a quantitative point of view, the study analyses how pupils’ utterance length varies according to the openness of the lesson by allowing self-initiated topic changes as well as the design of the book activity (1) led by teachers or (2) by the pupils. From a qualitative stance, a sequence-by-sequence analysis of the jointly constructed narrative identifies the interactional dynamics of the collaborative storytelling activities and the use of self-initiated topic changes which children draw upon to express themselves more freely.
The results show that children’s utterances vary according to the activity type. Pupils produce longer utterances, when they can self-initiate a topic hereby boosting their second language proficiency – either because the teacher is withdrawing or because the participation framework is open enough for them to make creative use of the language. The children also show their capability of successfully managing topic changes without the presence of the teacher while at the same time co-constructing the meaning of the story and paying attention to lexical details. The interviews reveal the teachers’ astonishment for the degree of pupil participation as well as their pedagogical practices. Implications from the analysis are gathered in a theoretical model that links opportunities for self-initiated topic changes to language proficiency. Recommendations for a more active pupil participation during book related activities point to sense-making, joint topic negotiation and story enactment.O objetivo do presente estudo que utiliza vários métodos, consiste em investigar o processo de aprendizagem do Luxemburguês de crianças entre 4 e 5 anos de idade de ascendência portuguesa, cabo verdiana e brasileira durante atividades com livros de leitura na educação infantil. 47,2% da população do jardim de infância são de origem estrangeira e a comunidade de lusófonos forma o grupo maior com 24,1%. Essa população multilíngue e multicultural, que vem crescendo rapidamente, aprende o Luxemburguês para a integração e a interação quotidiana e como resultado desafia a educação pública com suas necessidades diversificadas e alteradas.
Essa pesquisa alarga estudos de luxemburguês como segunda língua e combina investigações sobre temas, mas adota uma perspetiva pragmática. Ao analisar a gestão local de tema tal e o seu impacto na atividades menos controladas pelas professoras no primeiro plano, o estudo descreve o processo de aprendizagem de luxemburguês por essas crianças como uma interação co-construída entre crianças e professores? O foco é em crianças que negociam o sentido de uma história através de trocas de temas iniciados pelas próprias crianças durante atividades com os livros de leitura: leitura conjunta, contação de histórias e brincadeira. Os dados consistem em aulas filmadas e em entrevistas de lembrança estimulada com as professoras. Vários métodos foram utilizados para investigar a interação e o aprendizado da segunda língua dos alunos. Com uma abordagem quantitativa, o estudo analisa como a extensão dos comentários das crianças varia conforme o grau de abertura da atividade que permite trocas de temas e o design da atividade do livro que é guiada 1) pelas professoras ou 2) pelos alunos. De um ângulo qualitativo, uma análise das sequências das narrativas construídas em conjunto identifica a dinâmica da interação das contações de histórias conjuntas e o uso de trocas de tema usadas pelas crianças para expressarem-se mais livremente.
Os resultados apontam para a variabilidade dos comentários infantis nos dois tipos de atividades. Os alunos produziram enunciados mais longos quando iniciaram autonomamente uma troca de tema alargando o seu conhecimento linguístico com isso – ou porque a professora deixou de conduzir a tarefa rigidamente ou porque a estrutura de participação é bastante aberta que os permitem usar a língua criativamente. As crianças também revelaram a sua capacidade em fazer trocas de tema com sucesso na ausência da professora enquanto construíam conjuntamente o sentido da história e atentavam para detalhes linguísticos. As entrevistas com as professoras revelaram a sua surpresa com o grau de participação dos alunos e com a sua própria prática pedagógica. Implicações da análise juntam-se num modelo teórico que liga oportunidades para trocas de tema auto-iniciada à proficiência. O estudo sugere que uma maior participação dos alunos durante as atividades com livros de leitura pode ser otimizada pela produção de sentido, negociação de tema conjunta e brincadeira de histórias
AudioPaLM: A Large Language Model That Can Speak and Listen
We introduce AudioPaLM, a large language model for speech understanding and
generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2
[Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified
multimodal architecture that can process and generate text and speech with
applications including speech recognition and speech-to-speech translation.
AudioPaLM inherits the capability to preserve paralinguistic information such
as speaker identity and intonation from AudioLM and the linguistic knowledge
present only in text large language models such as PaLM-2. We demonstrate that
initializing AudioPaLM with the weights of a text-only large language model
improves speech processing, successfully leveraging the larger quantity of text
training data used in pretraining to assist with the speech tasks. The
resulting model significantly outperforms existing systems for speech
translation tasks and has the ability to perform zero-shot speech-to-text
translation for many languages for which input/target language combinations
were not seen in training. AudioPaLM also demonstrates features of audio
language models, such as transferring a voice across languages based on a short
spoken prompt. We release examples of our method at
https://google-research.github.io/seanet/audiopalm/examplesComment: Technical repor
Acoustic Modelling for Under-Resourced Languages
Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones.
In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
- …