Search CORE

111,017 research outputs found

A Corpus-Based Approach to Linguistic Function

Author: Webster Jonathan
Yan Hengbin
Publication venue: Department of English, National Chengchi University
Publication date: 01/01/2013
Field of study

A cognitive approach to event structures in Middle Mongolian based on the corpus "The secret history of the Mongols"

Author: Narmandakh Enkhmaa
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 09/07/2018
Field of study

This dissertation entitled “A Cognitive Approach to Event Structures in Middle Mongolian Based on the corpus ‘The Secret History of the Mongols’” considers itself an empirical study. It deals with the question of how the underlying language knowledge, which manifests in the language usage of the one or more (here: anonymous) authors, can be derived from a historically transmitted text corpus such as the “Secret History Mongols”. It intends to analyze methodically the totality of verbal relevant phenomena in Middle Mongolian as they appear in the text corpus with 29,396 lexemes from a cognitive-typological point of view. The aim was also to provide a basic typology of verb formation by systematically examining the data for patterns of usage (usage-based) in terms of their frequency as well as the associated markedness because this provides evidence for the degree of language usage. It can be assumed that no linguistic element is independent in its meaning from its textual environment which is especially evident in the phrase units, thus each linguistic element has a meaning/function that is relevant to a larger embedding construction. The structure of a text can be compared to the structure of an “onion” in which all layers are related with each other as part of a whole, with each layer expressing a linguistic sign. In this process, patterns are revealed in all layers, which in turn are analyzed for their semanticity

Digitale Hochschulschriften der LMU

Text authorship identified using the dynamics of word co-occurrence networks

Author: Akimushkin Camilo
Amancio Diego R.
Oliveira Jr Osvaldo N.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 29/07/2016
Field of study

The identification of authorship in disputed documents still requires human expertise, which is now unfeasible for many tasks owing to the large volumes of text and authors in practical applications. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. The series were proven to be stationary (p-value>0.05), which permits to use distribution moments as learning attributes. With an optimized supervised learning procedure using a Radial Basis Function Network, 68 out of 80 texts were correctly classified, i.e. a remarkable 85% author matching success rate. Therefore, fluctuations in purely dynamic network metrics were found to characterize authorship, thus opening the way for the description of texts in terms of small evolving networks. Moreover, the approach introduced allows for comparison of texts with diverse characteristics in a simple, fast fashion

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare

Building a Corpus of 2L English for Automatic Assessment: the CLEC Corpus

Author: Calderón López María Isabel
Merino Ferradá María del Carmen
Noya Gallardo María del Carmen
Zarco Tejada María Ángeles
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

In this paper we describe the CLEC corpus, an ongoing project set up at the University of Cádiz with the purpose of building up a large corpus of English as a 2L classified according to CEFR proficiency levels and formed to train statistical models for automatic proficiency assessment. The goal of this corpus is twofold: on the one hand it will be used as a data resource for the development of automatic text classification systems and, on the other, it has been used as a means of teaching innovation techniques

Elsevier - Publisher Connector

Repositorio de Objetos de Docencia e Investigación de la Universidad de Cádiz

Holistic corpus-based dialectology

This paper is concerned with sketching future directions for corpus-based dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw on computationally advanced multivariate analysis techniques (such as multidimensional scaling, cluster analysis, and principal component analysis), and (iii) aid interpretation of empirical results by marshalling state-of-the-art data visualization techniques. To exemplify this line of analysis, we present a case study which explores joint frequency variability of 57 morphosyntax features in 34 dialects all over Great Britain

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

The University of Manchester - Institutional Repository

Token-based typology and word order entropy: A study based on universal dependencies

Author: Levshina N.
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2019
Field of study

The present paper discusses the benefits and challenges of token-based typology, which takes into account the frequencies of words and constructions in language use. This approach makes it possible to introduce new criteria for language classification, which would be difficult or impossible to achieve with the traditional, type-based approach. This point is illustrated by several quantitative studies of word order variation, which can be measured as entropy at different levels of granularity. I argue that this variation can be explained by general functional mechanisms and pressures, which manifest themselves in language use, such as optimization of processing (including avoidance of ambiguity) and grammaticalization of predictable units occurring in chunks. The case studies are based on multilingual corpora, which have been parsed using the Universal Dependencies annotation scheme

MPG.PuRe

Learning Language from a Large (Unannotated) Corpus

Author: Goertzel Ben
Vepstas Linas
Publication venue
Publication date: 14/01/2014
Field of study

A novel approach to the fully automated, unsupervised extraction of dependency grammars and associated syntax-to-semantic-relationship mappings from large text corpora is described. The suggested approach builds on the authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well as on a number of prior papers and approaches from the statistical language learning literature. If successful, this approach would enable the mining of all the information needed to power a natural language comprehension and generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa

arXiv.org e-Print Archive

CiteSeerX

Les stratégies linguistiques utilisées pour renvoyer à des politiciens dans la presse écrite en espagnol: une analyse des constructions appositives

Author: Vande Casteele An
Publication venue: Cádiz: Universidad de Cádiz, Servicio de Publicaciones
Publication date: 01/01/2010
Field of study

This article proposes an empirical study on the naming strategies when referring to politicians. More precisely, our analysis is based on a corpus of journalistic articles in contemporary Spanish and gathers a homogeneous set of so-called loose appositive constructions referring to politicians. The aim of the study is to provide a structured analysis of the linguistic features of the appositive constructions collected in our database. We will argue that the 1084 appositive constructions examined present a predicational relationship. We distinguish between three types of predicational appositive constructions: specificational, attributional, and identificational constructions. Together with a description of the syntactic and semantic characteristics of these constructions, we also examine their informative role in journalistic articles. More precisely, we consider how syntax reflects a different semantic content and a different discourse function

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio de Objetos de Docencia e Investigación de la Universidad de Cádiz