518 research outputs found
On the use of probabilistic grammars in speech annotation and segmentation tasks
International audienceThe present paper explores the issue of corpus prosodic parsing in terms of prosodic words. This question is of importance in both speech processing and corpus annotation studies. We propose a method grounded on both statistical ans symbolic (phonologicial) representations of tonal phenomena and we have recourse to probabilisitic grammars, within which we implement a minimal prosodic hierarchical structure. Both stages of probabilistic grammar building and its testing in prediction are explored and quantitatively and qualitatively evaluated
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Introduction to the special issue on cross-language algorithms and applications
With the increasingly global nature of our everyday interactions, the need for multilingual technologies to support efficient and efective information access and communication cannot be overemphasized. Computational modeling of language has been the focus of
Natural Language Processing, a subdiscipline of Artificial Intelligence. One of the current challenges for this discipline is to design methodologies and algorithms that are cross-language in order to create multilingual technologies rapidly. The goal of this JAIR special
issue on Cross-Language Algorithms and Applications (CLAA) is to present leading research in this area, with emphasis on developing unifying themes that could lead to the development of the science of multi- and cross-lingualism. In this introduction, we provide the reader with the motivation for this special issue and summarize the contributions of the papers that have been included. The selected papers cover a broad range of cross-lingual technologies including machine translation, domain and language adaptation for sentiment
analysis, cross-language lexical resources, dependency parsing, information retrieval and knowledge representation. We anticipate that this special issue will serve as an invaluable resource for researchers interested in topics of cross-lingual natural language processing.Postprint (published version
The pervasiveness of language contact: Evidence from negative existentials in Romeyka/Turkish code-switching
This paper investigates the morpho-syntactic features of language contact in the endangered Greek dialect Romeyka with Turkish. We analyze the use of the borrowed negative existential jok to (a) determine its role in Romeyka’s negation patterns (b) examine the effects of contact in Romeyka through cross-linguistic comparisons of jok with Turkish and forms of the dialect as spoken in Greece and (c) apply the identified grammatical patterns of jok to Myers-Scotton’s linguistic explanations for the code switching phenomena in the Matrix Language Turnover Hypothesis. The analysis demonstrates the pervasive influence of Turkish on the morpho-syntax of Romeyka through the incorporation of Turkish grammatical structures. We observe changes in the fundamental predicate grammar that are aligned with Turkish and that are inconsistent with Pontic’s existential constructions where the verb indicating existence is used. The patterns of contact confirm the Matrix Language hypothesis and provide evidence that indicate that Romeyka may be undergoing language turnover. Our findings are relevant to further understanding code switching among speakers of minority languages and assessing the vitality of Romeyka in Turkey
Recommended from our members
DIRECTIONAL HARMONIC SERIALISM
This dissertation proposes a novel phonological framework, directional Harmonic Serialism, that synthesizes constraint-based, rule-based, and formal language theoretic approaches to phonology. I illustrate its advantages in the domains of feature spreading, quantity-insensitive footing, and autosegmental phonology. Specifically, I demonstrate that across these disparate domains, directional Harmonic Serialism makes empirical predictions that more tightly model natural language phonology than alternative theories and that it does so using fewer theoretical mechanisms. At a high level, the theory outperforms alternatives using a simpler, more restricted toolkit
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across the world’s languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-utilization of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such an approach could be facilitated by recent developments in data-driven induction of typological knowledge.</jats:p
Normalization of Dutch user-generated content
Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system's robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work
- …