9 research outputs found

    A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

    Full text link
    The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-ofthe-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-theart level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.Comment: 7 pages, 6 figure

    Language identification for German-Turkish code-switching speech

    Get PDF
    The importance of computers has risen in recent years in our daily lives. An average person interacts without a doubt multiple times with computers. The wide usage of computers has caused researchers to think of ways which would allow you to communicate with computers by a minimum number of interactions. Speech is the main communication instrument for humans, so researchers also used speech as an interaction method between humans and computers. However, speech has boundaries of its own, the language varies in different societies, especially in multicultural societies where people tend to use a mixed language called Code-Switching language to communicate, i.e. Germany is a multicultural country and foreigners, especially bilingual Turkish people, use German and Turkish when they speak to each other. On the other hand, computers nowadays have become more powerful and can also process complex tasks such as NLP tasks, which requires a lot of processing power. In this thesis we aimed to solve Language Identification task in German-Turkish code-switching speeches with two popular machine learning methods Support Vector Machines and Deep Neural Networks and at the end we compared the performances of these methods.Die Bedeutung von Computern ist in den letzten Jahren in unserem alltäglichen Leben gestiegen. Die durchschnittliche Person interagiert sich ohne Zweifel mehrmals am Tag mit Computern um. Dieser verbreitete Einsatz hat dazu geführt, dass die Forscher nach Möglichkeiten suchen, die uns ermöglichen mit Computern durch die minimalste Anzahl möglicher Interaktionen zu kommunizieren. Sprechen ist das wichtigste Kommunikationsinstrument für Menschen, deswegen haben die Forscher auch die Sprache als Interaktionsmethode zwischen Mensch und Computer verwendet. Allerdings hat die Sprache ihre Grenzen, die Sprache variiert sich in verschiedenen Gesellschaften, vor allem in multikulturellen Gesellschaften, in denen Menschen dazu neigen eine gemischte Sprache namens Code-Switching Sprache zu benutzen. Deutschland beispielsweise ist ein multikulturelles Land wo Ausländer, vor allem zweisprachige Türken sowohl Deutsch als auch Türkisch beim kommunizieren benutzen. Dennoch sind Computern heute leistungsstärker geworden und können auch komplexe Aufgaben wie NLP-Aufgaben verarbeiten, die viel Rechenleistung erfordern. In dieser Arbeit zielen wir darauf hin, die Sprachidentifizierungsaufgabe in deutsch-türkischen Code-Switching Sprache mit zwei populären maschinellen Lernmethoden zu unterstützen. Support Vector Machines und Deep Neural Networks und ein Vergleich der Leistungen diese Methoden

    Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

    Get PDF
    Linguistic typology aims to capture structural and semantic variation across the world's languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-employment of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such approach could be facilitated by recent developments in data-driven induction of typological knowledge

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201
    corecore