111,017 research outputs found

    A Corpus-Based Approach to Linguistic Function

    Get PDF

    A cognitive approach to event structures in Middle Mongolian based on the corpus "The secret history of the Mongols"

    Get PDF
    This dissertation entitled “A Cognitive Approach to Event Structures in Middle Mongolian Based on the corpus ‘The Secret History of the Mongols’” considers itself an empirical study. It deals with the question of how the underlying language knowledge, which manifests in the language usage of the one or more (here: anonymous) authors, can be derived from a historically transmitted text corpus such as the “Secret History Mongols”. It intends to analyze methodically the totality of verbal relevant phenomena in Middle Mongolian as they appear in the text corpus with 29,396 lexemes from a cognitive-typological point of view. The aim was also to provide a basic typology of verb formation by systematically examining the data for patterns of usage (usage-based) in terms of their frequency as well as the associated markedness because this provides evidence for the degree of language usage. It can be assumed that no linguistic element is independent in its meaning from its textual environment which is especially evident in the phrase units, thus each linguistic element has a meaning/function that is relevant to a larger embedding construction. The structure of a text can be compared to the structure of an “onion” in which all layers are related with each other as part of a whole, with each layer expressing a linguistic sign. In this process, patterns are revealed in all layers, which in turn are analyzed for their semanticity

    Text authorship identified using the dynamics of word co-occurrence networks

    Full text link
    The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion

    Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus

    Get PDF
    In this paper we describe the CLEC corpus, an ongoing project set up at the University of CĂĄdiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques

    Holistic corpus-based dialectology

    Get PDF
    This paper is concerned with sketching future directions for corpus-based dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw on computationally advanced multivariate analysis techniques (such as multidimensional scaling, cluster analysis, and principal component analysis), and (iii) aid interpretation of empirical results by marshalling state-of-the-art data visualization techniques. To exemplify this line of analysis, we present a case study which explores joint frequency variability of 57 morphosyntax features in 34 dialects all over Great Britain

    Token-based typology and word order entropy: A study based on universal dependencies

    No full text
    The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme

    Learning Language from a Large (Unannotated) Corpus

    Full text link
    A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

    Les stratégies linguistiques utilisées pour renvoyer à des politiciens dans la presse écrite en espagnol: une analyse des constructions appositives

    Get PDF
    This article proposes an empirical study on the naming strategies when referring to politicians. More precisely, our analysis is based on a corpus of journalistic articles in contemporary Spanish and gathers a homogeneous set of so-called loose appositive constructions referring to politicians. The aim of the study is to provide a structured analysis of the linguistic features of the appositive constructions collected in our database. We will argue that the 1084 appositive constructions examined present a predicational relationship. We distinguish between three types of predicational appositive constructions: specificational, attributional, and identificational constructions. Together with a description of the syntactic and semantic characteristics of these constructions, we also examine their informative role in journalistic articles. More precisely, we consider how syntax reflects a different semantic content and a different discourse function
    • 

    corecore