11,972 research outputs found
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.</jats:p
Learning Language Representations for Typology Prediction
One central mystery of neural NLP is what neural models "know" about their
subject matter. When a neural machine translation system learns to translate
from one language to another, does it learn the syntax or semantics of the
languages? Can this knowledge be extracted from the system to fill holes in
human scientific knowledge? Existing typological databases contain relatively
full feature specifications for only a few hundred languages. Exploiting the
existence of parallel texts in more than a thousand languages, we build a
massive many-to-one neural machine translation (NMT) system from 1017 languages
into English, and use this to predict information missing from typological
databases. Experiments show that the proposed method is able to infer not only
syntactic, but also phonological and phonetic inventory features, and improves
over a baseline that has access to information about the languages' geographic
and phylogenetic neighbors.Comment: EMNLP 201
Problems of methodology and explanation in word order universals research
Ever since the publication of Greenberg 1963, word order typologists have attempted to formulate and refine implicational universals of word order so as to characterize the restricted distribution of certain word order patterns, and in some cases have also attempted to develop general principles to explain the existence of those universals
Lexical typology through similarity semantics: Toward a semantic map of motion verbs
This paper discusses a multidimensional probabilistic semantic map of lexical motion verb stems based on data collected from parallel texts (viz. translations of the Gospel according to Mark) for 100 languages from all continents. The crosslinguistic diversity of lexical semantics in motion verbs is illustrated in detail for the domain of `go', `come', and `arrive' type contexts. It is argued that the theoretical bases underlying probabilistic semantic maps from exemplar data are the isomorphism hypothesis (given any two meanings and their corresponding forms in any particular language, more similar meanings are more likely to be expressed by the same form in any language), similarity semantics (similarity is more basic than identity), and exemplar semantics (exemplar meaning is more fundamental than abstract concepts)
- …