13 research outputs found

    Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

    Full text link
    Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.Comment: EACL 202

    AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

    Full text link
    Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task

    Tercera Conferencia de Creative Commons en América Latina

    Get PDF
    Esta obra y todos y sus contenidos se encuentran licenciados bajos la licencia Creative Commons Atribución 3.0 unported.La presente compilación reúne las historias y el estado actual de los capítulos de Creative Commons en América Latina. Organizada por Bienes Comunes A.C., la Tercera Conferencia de Creative Commons en América Latina (Buenos Aires 2010) significó una excelente oportunindad para invitar a los líderes de los capítulos locales y a sus respectivas instituciones a escribir colaborativamente nuestra historia regional común. La generosa respuesta de cada uno de ellos y el financiamiento recibido de cada uno de ellos (Catalyst Grant) permitió alcanzar esta obra. La compilación consta de diez capítulos que, ordenados alfabéticamente, describen las historias de cada capítulo, sus formas de trabajo, relaciones con las comunidades, proyectos y próximos pasos. Se incluyen las experiencias de Argentina, Brasil, Chile, Colombia, Costa Rica, Ecuador, Guatemala, México, Perú y Perto Rico.Bienes Comunes A.C., Fundación Sociedades Digitales, Creative Commons, Universidad de Costa Rica, otras...UCR::Vicerrectoría de Investigació

    Characteristics Shared by the Scientific Electronic Journals of Latin America and the Caribbean

    Get PDF
    Our objective is to analyze the use that Latin American peer-reviewed journals make of the tools and opportunities provided by electronic publishing, particularly of those that would make them evolve to be more than “mere photocopies” of their printed counterparts. While doing these, we also set out to discover if there were any Latin American journals that use these technologies in an effective way, comparable to the most innovative journals in existence. We extracted a sample of 125 journals from the LATINDEX – Regional System of Scientific Journals of Latin America, the Caribbean, Spain and Portugal – electronic resources index, and compared along five dimensions: (1) Non-linearity, (2) use of multimedia, (3) linking to external resources (“multiple use”), (4) interactivity, and (5) use of metadata, search engines, and other added resources. We have found that very few articles in these journals (14%) used non-linear links to navigate between different sections of the article. Almost no journals (3%) featured multimedia contents. About one in every four articles (26%) published in the journals analyzed had their references or bibliographic items enriched by links that connected to the original documents quoted by the author. The most common form of interaction was user↔journal, in the form of question forms (17% of journals) and new issue warnings (17% of journals). Some, however (5%) had user↔user interaction, offering forums and response to published articles by the readership. About 35% of the journals have metadata within their pages, and 50% offer search engines to their users. One of the most pressing problems for these journals it the wrong use of rather simple technologies such as linking: 49% of the external resource links were mismarked in some way, with a full 24% being mismarked by spelling or layout mistakes. Latin American journals still present a number of serious limitations when using electronic resources and techniques, with text being overwhelmingly linear and underlinked, e-mail to the editors being the main means of contact, and multimedia as a scarce commodity. We selected a small sample of journals from other regions of the world, and found that they offer significantly more nonlinearity (p = 0.005 0.1), interactive features (p = 0.005 0.1), use of multimedia (p = 0.04 0.1) and linking to external documents (p = 0.007 0.1). While these are the current characteristics of Latin American journals, a number of very notable exceptions speak volumes of the potential of these technologies to improve the quality of Latin American scholarly publishing

    Quantifying change in morphological complexity as a tool for language revitalization

    No full text
    Lupyan and Dale’s (2010) method to calculate morphological complexity can be used to track changes in the morphological complexity of endangered languages. This method of quantifying change can provide information about language loss, but also serve as a tool for measuring the progress of language teaching and reclamation

    A Reanalysis of Bribri Relative Clauses as a Case of Linearization within Minimalist Theory

    No full text
    Se presenta un análisis minimalista de la cláusula relativa en la lengua bribri, usando la linealización de Kayne (1994) para reinterpretar la estructura de las cláusulas relativas con cabeza interna. En esta reinterpretación, el elemento e' (o cualquier otro determinante en la lengua bribri) sería la cabeza de una frase determinante (FD). Esta FD tendría una frase nominal (FN) como complementizador, cuya cabeza sería una copia de la “cabeza interna” de la cláusula relativa. Durante el proceso de linealización, la copia externa puede anularse fonéticamente o expresarse de forma completa, dependiendo del nivel de determinación de los participantes en la cláusula relativa.Using Kayne’s linearization axioms (1994), I propose a reanalysis of the head-internal relative clauses of the Bribri language (Chibchan Macrofamily). In this reinterpretation, the determinant e' (or any of the determinants used in this language) would function as the head of a DP. This DP would have an NP complement, whose head would be a phonetically empty copy of the relativized element. This element would be fully realized within the relative clause, while the copy would be annulled during the linearization process, resulting in a head-internal structure. The annulled copy can be fully realized if the resulting structure presents ambiguities.UCR::Vicerrectoría de Docencia::Artes y Letras::Facultad de Letras::Escuela de Filología, Lingüística y Literatur

    Tonal reduction and literacy in Me’ph aa Vátháá

    No full text
    This study examines the relationship between tonal phonetics, tonal reduction and orthographic patterns produced by Me'ph aa Vátháá speaking teachers. It discusses these patterns in the context of Indigenous education in Mexico and of the language ideologies held by the teachers, which have parallels to those held by speakers of Spanish and practitioners of language revitalization. Its main finding is that tones undergo phonetic changes which reduce their relative psychoacoustic distances, and this combines with the writing practices of the teachers (in which they repeat the words to themselves at varying speeds) to produce hesitation when writing the tonal markers. This is framed in an ideological process of privileging writing as the ideal form of language revitalization, and of rejection of variants and spelling `mistakes', which results in further linguistic insecurity by the teachers. This has repercussions for the revitalization of the language, in that teachers sometimes choose not to write in M e'phaa Vátháá, particularly in contexts involving technology such as social media, out of fear of making 'mistakes'. In studying these phenomena, this study also describes the processes of tonal reduction in Me'phaa Vátháá and describes its similarities and differences with the reduction described for other tonal languages such as Mandarin, Thai and Triqui. Tonal reduction processes in Me'ph aa Vátháá are not an exact match to any of these languages, which suggests that, while reduction is universal, it has language-specific expressions, which suggest that reduction typologies should be further studied. In addition to this, the study offers a report on the process of tonal spelling learning by adults who didn't receive this training as children. This is relevant to both educational and language planners, as well as to practitioners of language revitalization

    Buenas prácticas en las revistas electrónicas latinoamericanas

    Get PDF
    Título del libro: Calidad e Impacto de la revista Iberoamericana.Se presenta un análisis de las características que podrían tener las revistas electrónicas en el mundo y la situación específica de las latinoamericanas las cuales, según estudios realizados en los últimos dos años, muestran un panorama poco alentador. Se tomó una muestra de revistas que cumplen los criterios de calidad Latindex específicos para revistas electrónicas, y se investigó cuál era la prevalencia de características como la hipertextualidad, el uso de multimedios, la interactividad usuario-revista, y la presencia de metadatos no automáticos. Solo un 15% de las revistas tiene funciones de hipertextualidad y navegación entre los contenidos de sus artículos; solo el 12% usa multimedios, solo el 4% tiene foros para que los lectores interactúen entre sí, y solo el 62% tiene metadatos que no sean automáticos. No obstante, se resalta la presencia de varias revistas que constituyen ejemplos de buenas prácticas, de acuerdo con la valoración de esos mismos parámetros que fueron estudiados. Los resultados demuestran la existencia de destacables excepciones en la región y la necesidad de redoblar esfuerzos en la capacitación de los editores para lograr mejores niveles de explotación de los recursos que ofrece la Web.We present an analysis of the features that should be present in electronic journals, both in the world and specifically in Latin America. According to research from the last two years, the presence of these features in Latin American journals is not very frequent. We selecteda sample of journals out of the journals from the Latindex Catalog that passed the electronic evaluation criteria of this database, and examined the prevalence of features such as hyperlinks, multimedia, user-journal interaction and presence of non-automatic metadata. Only 15% of the journals used hyperlinks for the users to navigate between the sections of a paper. Only 12% of the journals had multimedia contents, only 4% had forums for interaction, and only 62% had any non-automatic metadata. Even with this situation, several journals are very visibly trying to improve their quality and have become examples in the region. We stress the need of more editor training as the means to achieve better usage of the electronic resources available to journals.Universidad de Costa Rica, INASP, CONACYT, UNAMUCR::Vicerrectoría de Docencia::Ciencias Sociales::Facultad de Educación::Escuela de Bibliotecología y Ciencias de la Informació

    Metadata Usage Tendencies in Latin American Electronic Journals

    Get PDF
    The present study investigates the extent to which metadata tags are used in Latin American electronic journals, and whether these journals in fact provide basic information (abstracts, keywords, etc.) that could be tagged as metadata. The authors also studied multilingualism in the marked-up information and in the basic information, particularly the use of English (which can help bring the scientific production of Latin America to a wider audience). In total, 45% of the journals had metadata; the metatags keywords and description were the most commonly used. The inclusion of structured metadata from the Dublin Core Metadata Element Set in the journals was found to be very low, only 13%, and primarily existed in journals from Argentina, Costa Rica, and Brazil. The articles examined did not always include abstracts and keywords (84% and 77% respectively), but in the articles that did have them, English was frequently used (85% in abstracts and 91% in keywords). The element was found to be used deficiently: Only 42% of full text OA articles had their actual title in the tag, which can potentially affect visibility in a search engine results. In sum, the road to marked-up metadata in all journals is still long, and there are great inconsistencies in how metadata are employed and in their content. The authors conclude that there are signs that support and efforts to increase awareness of how metadata can easily be included in a journal’s web site may result in improved metadata and greater visibility.MICIT Costa Rica: Solicitud de Fondo de Incentivos FI-121-2009UCR::Vicerrectoría de InvestigaciónUCR::Vicerrectoría de Docencia::Ciencias Sociales::Facultad de Educación::Escuela de Bibliotecología y Ciencias de la Informació
    corecore