9 research outputs found
A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings
The lack of annotated data in many languages is a well-known challenge within
the field of multilingual natural language processing (NLP). Therefore, many
recent studies focus on zero-shot transfer learning and joint training across
languages to overcome data scarcity for low-resource languages. In this work we
(i) perform a comprehensive comparison of state-ofthe-art multilingual word and
sentence encoders on the tasks of named entity recognition (NER) and part of
speech (POS) tagging; and (ii) propose a new method for creating multilingual
contextualized word embeddings, compare it to multiple baselines and show that
it performs at or above state-of-theart level in zero-shot transfer settings.
Finally, we show that our method allows for better knowledge sharing across
languages in a joint training setting.Comment: 7 pages, 6 figure
Language identification for German-Turkish code-switching speech
The importance of computers has risen in recent years in our daily lives. An average person interacts without a doubt multiple times with computers. The wide usage of computers has caused researchers to think of ways which would allow you to communicate with computers by a minimum number of interactions. Speech is the main communication instrument for humans, so researchers also used speech as an interaction method between humans and computers. However, speech has boundaries of its own, the language varies in different societies, especially in multicultural societies where people tend to use a mixed language called Code-Switching language to communicate, i.e. Germany is a multicultural country and foreigners, especially bilingual Turkish people, use German and Turkish when they speak to each other. On the other hand, computers nowadays have become more powerful and can also process complex tasks such as NLP tasks, which requires a lot of processing power. In this thesis we aimed to solve Language Identification task in German-Turkish code-switching speeches with two popular machine learning methods Support Vector Machines and Deep Neural Networks and at the end we compared the performances of these methods.Die Bedeutung von Computern ist in den letzten Jahren in unserem alltäglichen Leben gestiegen. Die durchschnittliche Person interagiert sich ohne Zweifel mehrmals am Tag mit Computern um. Dieser verbreitete Einsatz hat dazu geführt, dass die Forscher nach Möglichkeiten suchen, die uns ermöglichen mit Computern durch die minimalste Anzahl möglicher Interaktionen zu kommunizieren. Sprechen ist das wichtigste Kommunikationsinstrument für Menschen, deswegen haben die Forscher auch die Sprache als Interaktionsmethode zwischen Mensch und Computer verwendet. Allerdings hat die Sprache ihre Grenzen, die Sprache variiert sich in verschiedenen Gesellschaften, vor allem in multikulturellen Gesellschaften, in denen Menschen dazu neigen eine gemischte Sprache namens Code-Switching Sprache zu benutzen. Deutschland beispielsweise ist ein multikulturelles Land wo Ausländer, vor allem zweisprachige Türken sowohl Deutsch als auch Türkisch beim kommunizieren benutzen. Dennoch sind Computern heute leistungsstärker geworden und können auch komplexe Aufgaben wie NLP-Aufgaben verarbeiten, die viel Rechenleistung erfordern. In dieser Arbeit zielen wir darauf hin, die Sprachidentifizierungsaufgabe in deutsch-türkischen Code-Switching Sprache mit zwei populären maschinellen Lernmethoden zu unterstützen. Support Vector Machines und Deep Neural Networks und ein Vergleich der Leistungen diese Methoden
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201