756 research outputs found

    Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure

    Get PDF
    It has been established that incorporating word cluster features derived from large unlabeled corpora can significantly improve prediction of linguistic structure. While previous work has focused primarily on English, we extend these results to other languages along two dimensions. First, we show that these results hold true for a number of languages across families. Second, and more interestingly, we provide an algorithm for inducing cross-lingual clusters and we show that features derived from these clusters significantly improve the accuracy of cross-lingual structure prediction. Specifically, we show that by augmenting direct-transfer systems with cross-lingual cluster features, the relative error of delexicalized dependency parsers, trained on English treebanks and transferred to foreign languages, can be reduced by up to 13%. When applying the same method to direct transfer of named-entity recognizers, we observe relative improvements of up to 26%

    Arabic-English Text Translation Leveraging Hybrid NER

    Get PDF

    Corpora worth creating: A pilot study on telephone interpreting

    Get PDF
    This paper reports on the development and use of a corpus of interpreter-mediated phone calls to study features of telephone interpreting (TI) in healthcare settings. After a short introduction on TI and corpus-based studies of remote and on-site community interpreting (CI), the paper discusses ways of exploiting the corpus to analyse interpreters\u2019 translation and coordination activities over the phone. It first shows that, notwithstanding some limitations due to data originally collected for non-linguistic purposes, even a small and raw resource can contribute to exploratory analyses of TI, using a qualitative (Conversation Analysis) approach. It then illustrates how opportunities for more systematic research are opened up by corpus annotation. The paper finally reports on some preliminary insights about linguistic and interactional aspects characterizing this type of remote interpreting and makes a tentative comparison with two on-site CI corpora, thereby paving the way to more refined and quantitative investigations

    Beyond Stemming and Lemmatization: Ultra-stemming to Improve Automatic Text Summarization

    Full text link
    In Automatic Text Summarization, preprocessing is an important phase to reduce the space of textual representation. Classically, stemming and lemmatization have been widely used for normalizing words. However, even using normalization on large texts, the curse of dimensionality can disturb the performance of summarizers. This paper describes a new method for normalization of words to further reduce the space of representation. We propose to reduce each word to its initial letters, as a form of Ultra-stemming. The results show that Ultra-stemming not only preserve the content of summaries produced by this representation, but often the performances of the systems can be dramatically improved. Summaries on trilingual corpora were evaluated automatically with Fresa. Results confirm an increase in the performance, regardless of summarizer system used.Comment: 22 pages, 12 figures, 9 table

    A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

    Full text link
    Various types of social biases have been reported with pretrained Masked Language Models (MLMs) in prior work. However, multiple underlying factors are associated with an MLM such as its model size, size of the training data, training objectives, the domain from which pretraining data is sampled, tokenization, and languages present in the pretrained corpora, to name a few. It remains unclear as to which of those factors influence social biases that are learned by MLMs. To study the relationship between model factors and the social biases learned by an MLM, as well as the downstream task performance of the model, we conduct a comprehensive study over 39 pretrained MLMs covering different model sizes, training objectives, tokenization methods, training data domains and languages. Our results shed light on important factors often neglected in prior literature, such as tokenization or model objectives.Comment: Accepted to EMNLP 2023 main conferenc

    Corpora worth creating: A pilot study on telephone interpreting

    Get PDF
    This paper reports on the development and use of a corpus of interpreter-mediated phone calls to study features of telephone interpreting (TI) in healthcare settings. After a short introduction on TI and corpus-based studies of remote and on-site community interpreting (CI), the paper discusses ways of exploiting the corpus to analyse interpreters’ translation and coordination activities over the phone. It first shows that, notwithstanding some limitations due to data originally collected for non-linguistic purposes, even a small and raw resource can contribute to exploratory analyses of TI, using a qualitative (Conversation Analysis) approach. It then illustrates how opportunities for more systematic research are opened up by corpus annotation. The paper finally reports on some preliminary insights about linguistic and interactional aspects characterizing this type of remote interpreting and makes a tentative comparison with two on-site CI corpora, thereby paving the way to more refined and quantitative investigations

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants
    • …
    corecore