Search CORE

255 research outputs found

Token-based typology and word order entropy: A study based on universal dependencies

Author: Levshina N.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme

MPG.PuRe

A Multivariate Study of T/V Forms in European Languages Based on a Parallel Corpus of Film Subtitles

Author: Levshina Natalia
Publication venue: Wydawnictwo Uniwersytetu Łódzkiego
Publication date: 01/01/2017
Field of study

The present study investigates the cross-linguistic differences in the use of so-called T/V forms (e.g. French tu and vous, German du and Sie, Russian ty and vy) in ten European languages from different language families and genera. These constraints represent an elusive object of investigation because they depend on a large number of subtle contextual features and social distinctions, which should be cross-linguistically matched. Film subtitles in different languages offer a convenient solution because the situations of communication between film characters can serve as comparative concepts. I selected more than two hundred contexts that contain the pronouns you and yourself in the original English versions, which are then coded for fifteen contextual variables that describe the Speaker and the Hearer, their relationships and different situational properties. The creators of subtitles in the other languages have to choose between T and V when translating from English, where the T/V distinction is not expressed grammatically. On the basis of these situations translated in ten languages, I perform multivariate analyses using the method of conditional inference trees in order to identify the most relevant contextual variables that constrain the T/V variation in each language

Biblioteka Nauki - repozytorium artykuÅÃ³w

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Repozytorium Uniwersytetu Łódzkiego (University of Lodz Repository)

Frequency, informativity and word length: Insights from typologically diverse corpora

Author: Levshina N.
Publication venue: 'MDPI AG'
Publication date: 01/02/2022
Field of study

Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic convention

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Corpus-based typology: Applications, challenges and some solutions

Author: Levshina N.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 25/05/2022
Field of study

Over the last few years, the number of corpora that can be used for language comparison has dramatically increased. The corpora are so diverse in their structure, size and annotation style, that a novice might not know where to start. The present paper charts this new and changing territory, providing a few landmarks, warning signs and safe paths. Although no corpora corpus at present can replace the traditional type of typological data based on language description in reference grammars, they corpora can help with diverse tasks, being particularly well suited for investigating probabilistic and gradient properties of languages and for discovering and interpreting cross-linguistic generalizations based on processing and communicative mechanisms. At the same time, the use of corpora for typological purposes has not only advantages and opportunities, but also numerous challenges. This paper also contains an empirical case study addressing two pertinent problems: the role of text types in language comparison and the problem of the word as a comparative concept

MPG.PuRe

Semantic maps of causation: New hybrid approaches based on corpora and grammar descriptions

Author: Levshina N.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2022
Field of study

The present paper discusses connectivity and proximity maps of causative constructions and combines them with different types of typological data. In the first case study, I show how one can create a connectivity map based on a parallel corpus. This allows us to solve many problems, such as incomplete descriptions, inconsistent terminology and the problem of determining the semantic nodes. The second part focuses on proximity maps based on Multidimensional Scaling and compares the most important semantic distinctions, which are inferred from a parallel corpus of film subtitles and from grammar descriptions. The results suggest that corpus-based maps of tokens are more sensitive to cultural and genre-related differences in the prominence of specific causation scenarios than maps based on constructional types, which are described in reference grammars. The grammar-based maps also reveal a less clear structure, which can be due to incomplete semantic descriptions in grammars. Therefore, each approach has its shortcomings, which researchers need to be aware of

MPG.PuRe

How tight is your language? A semantic typology based on Mutual Information

Author: Levshina N.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Languages differ in the degree of semantic flexibility of their syntactic roles. For example, Eng- lish and Indonesian are considered more flexible with regard to the semantics of subjects, whereas German and Japanese are less flexible. In Hawkins’ classification, more flexible lan- guages are said to have a loose fit, and less flexible ones are those that have a tight fit. This classification has been based on manual inspection of example sentences. The present paper proposes a new, quantitative approach to deriving the measures of looseness and tightness from corpora. We use corpora of online news from the Leipzig Corpora Collection in thirty typolog- ically and genealogically diverse languages and parse them syntactically with the help of the Universal Dependencies annotation software. Next, we compute Mutual Information scores for each language using the matrices of lexical lemmas and four syntactic dependencies (intransi- tive subjects, transitive subject, objects and obliques). The new approach allows us not only to reproduce the results of previous investigations, but also to extend the typology to new lan- guages. We also demonstrate that verb-final languages tend to have a tighter relationship be- tween lexemes and syntactic roles, which helps language users to recognize thematic roles early during comprehension

Crossref

MPG.PuRe

Why we need a token-based typology: A case study of analytic and lexical causatives in fifteen European languages

Author: Levshina Natalia
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 26/01/2023
Field of study

This paper investigates variation of lexical and analytic causatives in 15 European languages from the Germanic, Romance, and Slavic genera based on a multilingual parallel corpus of film subtitles. Using typological parameters of variation of causatives from the literature, this study tests which parameters are relevant for the choice between analytic and lexical causatives in the sample of languages. The main research question is whether the variation is constrained by one semantic dimension, namely, the conceptual integration of the causing and caused events, as suggested by previous research on iconicity in language, or whether several different semantic and syntactic factors are at play. To answer this question, I use an exploratory multivariate technique for categorical data (Multiple Correspondence Analysis with supplementary points) and conditional random forests, a nonparametric regression and classification method. The study demonstrates the importance of corpus data in testing typological hypotheses

Qucosa - Publikationsserver der Universität Leipzig

Cross-linguistic trade-offs and causal relationships between cues to grammatical subject and object, and the problem of efficiency-related explanations

Author: Levshina N.
Publication venue: 'Frontiers Media SA'
Publication date: 01/07/2021
Field of study

Cross-linguistic studies focus on inverse correlations (trade-offs) between linguistic variables that reflect different cues to linguistic meanings. For example, if a language has no case marking, it is likely to rely on word order as a cue for identification of grammatical roles. Such inverse correlations are interpreted as manifestations of language users’ tendency to use language efficiently. The present study argues that this interpretation is problematic. Linguistic variables, such as the presence of case, or flexibility of word order, are aggregate properties, which do not represent the use of linguistic cues in context directly. Still, such variables can be useful for circumscribing the potential role of communicative efficiency in language evolution, if we move from cross-linguistic trade-offs to multivariate causal networks. This idea is illustrated by a case study of linguistic variables related to four types of Subject and Object cues: case marking, rigid word order of Subject and Object, tight semantics and verb-medial order. The variables are obtained from online language corpora in thirty languages, annotated with the Universal Dependencies. The causal model suggests that the relationships between the variables can be explained predominantly by sociolinguistic factors, leaving little space for a potential impact of efficient linguistic behavior

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Mécanismes de perception de l’acteur et de la représentation

Author: Levshina Elena
Publication venue: 'Consortium Erudit'
Publication date: 01/01/1995
Field of study

Crossref

Érudit

Token-based typology and word order entropy: A study based on Universal Dependencies (a draft)

Author: Natalia Levshina
Publication venue
Publication date
Field of study

The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by a case study of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as conventionalization of frequent patterns, optimization of processing and avoidance of ambiguity. The case studies are based on data from the Universal Dependencies corpora and Leipzig Corpora Collection

ZENODO