111,017 research outputs found
A cognitive approach to event structures in Middle Mongolian based on the corpus "The secret history of the Mongols"
This dissertation entitled âA Cognitive Approach to Event Structures in Middle Mongolian Based on the corpus âThe Secret History of the Mongolsââ considers itself an empirical study. It deals with the question of how the underlying language knowledge, which manifests in the language usage of the one or more (here: anonymous) authors, can be derived from a historically transmitted text corpus such as the âSecret History Mongolsâ. It intends to analyze methodically the totality of verbal relevant phenomena in Middle Mongolian as they appear in the text corpus with 29,396 lexemes from a cognitive-typological point of view.
The aim was also to provide a basic typology of verb formation by systematically examining the data for patterns of usage (usage-based) in terms of their frequency as well as the associated markedness because this provides evidence for the degree of language usage. It can be assumed that no linguistic element is independent in its meaning from its textual environment which is especially evident in the phrase units, thus each linguistic element has a meaning/function that is relevant to a larger embedding construction. The structure of a text can be compared to the structure of an âonionâ in which all layers are related with each other as part of a whole, with each layer expressing a linguistic sign. In this process, patterns are revealed in all layers, which in turn are analyzed for their semanticity
Text authorship identified using the dynamics of word co-occurrence networks
The identification of authorship in disputed documents still requires human
expertise, which is now unfeasible for many tasks owing to the large volumes of
text and authors in practical applications. In this study, we introduce a
methodology based on the dynamics of word co-occurrence networks representing
written texts to classify a corpus of 80 texts by 8 authors. The texts were
divided into sections with equal number of linguistic tokens, from which time
series were created for 12 topological metrics. The series were proven to be
stationary (p-value>0.05), which permits to use distribution moments as
learning attributes. With an optimized supervised learning procedure using a
Radial Basis Function Network, 68 out of 80 texts were correctly classified,
i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in
purely dynamic network metrics were found to characterize authorship, thus
opening the way for the description of texts in terms of small evolving
networks. Moreover, the approach introduced allows for comparison of texts with
diverse characteristics in a simple, fast fashion
Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus
In this paper we describe the CLEC corpus, an ongoing project set up at the University of CĂĄdiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques
Holistic corpus-based dialectology
This paper is concerned with sketching future directions for corpus-based dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw on computationally advanced multivariate analysis techniques (such as multidimensional scaling, cluster analysis, and principal component analysis), and (iii) aid interpretation of empirical results by marshalling state-of-the-art data visualization techniques. To exemplify this line of analysis, we present a case study which explores joint frequency variability of 57 morphosyntax features in 34 dialects all over Great Britain
Token-based typology and word order entropy: A study based on universal dependencies
The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
Les stratégies linguistiques utilisées pour renvoyer à des politiciens dans la presse écrite en espagnol: une analyse des constructions appositives
This article proposes an
empirical study on the naming strategies
when referring to politicians. More
precisely, our analysis is based on a
corpus of journalistic articles in
contemporary Spanish and gathers a
homogeneous set of so-called loose
appositive constructions referring to
politicians. The aim of the study is to
provide a structured analysis of the
linguistic features of the appositive
constructions collected in our database.
We will argue that the 1084 appositive
constructions examined present a
predicational relationship. We distinguish
between three types of predicational
appositive constructions: specificational,
attributional, and identificational
constructions. Together with a
description of the syntactic and semantic
characteristics of these constructions, we
also examine their informative role in
journalistic articles. More precisely, we
consider how syntax reflects a different
semantic content and a different
discourse function
- âŠ