1,873 research outputs found
Anti dependency distance minimization in short sequences: A graph theoretic approach
Dependency distance minimization (DDm) is a word order principle favouring the placement of syntactically related words close to each other in sentences. Massive evidence of the principle has been reported for more than a decade with the help of syntactic dependency treebanks where long sentences abound. However, it has been predicted theoretically that the principle is more likely to be beaten in short sequences by the principle of surprisal minimization (predictability maximization). Here we introduce a simple binomial test to verify such a hypothesis. In short sentences, we find anti-DDm for some languages from different families. Our analysis of the syntactic dependency structures suggests that anti-DDm is produced by star trees.Peer ReviewedPostprint (author's final draft
The optimality of syntactic dependency distances
It is often stated that human languages, as other biological systems, are
shaped by cost-cutting pressures but, to what extent? Attempts to quantify the
degree of optimality of languages by means of an optimality score have been
scarce and focused mostly on English. Here we recast the problem of the
optimality of the word order of a sentence as an optimization problem on a
spatial network where the vertices are words, arcs indicate syntactic
dependencies and the space is defined by the linear order of the words in the
sentence. We introduce a new score to quantify the cognitive pressure to reduce
the distance between linked words in a sentence. The analysis of sentences from
93 languages representing 19 linguistic families reveals that half of languages
are optimized to a 70% or more. The score indicates that distances are not
significantly reduced in a few languages and confirms two theoretical
predictions, i.e. that longer sentences are more optimized and that distances
are more likely to be longer than expected by chance in short sentences. We
present a new hierarchical ranking of languages by their degree of
optimization. The statistical advantages of the new score call for a
reevaluation of the evolution of dependency distance over time in languages as
well as the relationship between dependency distance and linguistic competence.
Finally, the principles behind the design of the score can be extended to
develop more powerful normalizations of topological distances or physical
distances in more dimensions
Optimality of syntactic dependency distances
It is often stated that human languages, as other biological systems, are shaped by cost-cutting pressures but, to what extent? Attempts to quantify the degree of optimality of languages by means of an optimality score have been scarce and focused mostly on English. Here we recast the problem of the optimality of the word order of a sentence as an optimization problem on a spatial network where the vertices are words, arcs indicate syntactic dependencies, and the space is defined by the linear order of the words in the sentence. We introduce a score to quantify the cognitive pressure to reduce the distance between linked words in a sentence. The analysis of sentences from 93 languages representing 19 linguistic families reveals that half of languages are optimized to a 70% or more. The score indicates that distances are not significantly reduced in a few languages and confirms two theoretical predictions: that longer sentences are more optimized and that distances are more likely to be longer than expected by chance in short sentences. We present a hierarchical ranking of languages by their degree of optimization. The score has implications for various fields of language research (dependency linguistics, typology, historical linguistics, clinical linguistics, and cognitive science). Finally, the principles behind the design of the score have implications for network science.Peer ReviewedPostprint (published version
Memory limitations are hidden in grammar
[Abstract] The ability to produce and understand an unlimited number of different sentences is a hallmark of human language. Linguists have sought to define the essence of this generative capacity using formal grammars that describe the syntactic dependencies between constituents, independent of the computational limitations of the human brain. Here, we evaluate this independence assumption by sampling sentences uniformly from the space of possible syntactic structures. We find that the average dependency distance between syntactically related words, a proxy for memory limitations, is less than expected by chance in a collection of state-of-the-art classes of dependency grammars. Our findings indicate that memory limitations have permeated grammatical descriptions, suggesting that it may be impossible to build a parsimonious theory of human linguistic productivity independent
of non-linguistic cognitive constraints
Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness
We propose and study a novel supervised approach to learning statistical
semantic relatedness models from subjectively annotated training examples. The
proposed semantic model consists of parameterized co-occurrence statistics
associated with textual units of a large background knowledge corpus. We
present an efficient algorithm for learning such semantic models from a
training sample of relatedness preferences. Our method is corpus independent
and can essentially rely on any sufficiently large (unstructured) collection of
coherent texts. Moreover, the approach facilitates the fitting of semantic
models for specific users or groups of users. We present the results of
extensive range of experiments from small to large scale, indicating that the
proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already
published at ECML/PKDD 201
The linear arrangement library: A new tool for research on syntactic dependency structures
The new and growing field of Quantitative Dependency Syntax has emerged at the crossroads between Dependency Syntax and Quantitative Linguistics. One of the main concerns in this field is the statistical patterns of syntactic dependency structures. These structures, grouped in treebanks, are the source for statistical analyses in these and related areas; dozens of scores devised over the years are the tools of a new industry to search for patterns and perform other sorts of analyses. The plethora of such metrics and their increasing complexity require sharing the source code of the programs used to perform such analyses. However, such code is not often shared with the scientific community or is tested following unknown standards. Here we present a new open-source tool, the Linear Arrangement Library (LAL), which caters to the needs of, especially, inexperienced programmers. This tool enables the calculation of these metrics on single syntactic dependency structures, treebanks, and collection of treebanks, grounded on ease of use and yet with great flexibility. LAL has been designed to be efficient, easy to use (while satisfying the needs of all levels of programming expertise), reliable (thanks to thorough testing), and to unite research from different traditions, geographic areas, and research fields.LAP is supported by Secretaria d’Universitats i Recerca de la Generalitat de Catalunya and the Social European Fund. RFC and LAP are supported by the grant TIN2017-89244-R from MINECO (Ministerio de Economía, Industria y Competitividad). RFC is also supported by the recognition 2017SGR-856 (MACDA) from AGAUR (Generalitat de Catalunya). JLE is funded by the grant PID2019-109137GB-C22 from MINECO.Peer ReviewedPostprint (published version
Bounds of the sum of edge lengths in linear arrangements of trees
A fundamental problem in network science is the normalization of the
topological or physical distance between vertices, that requires understanding
the range of variation of the unnormalized distances. Here we investigate the
limits of the variation of the physical distance in linear arrangements of the
vertices of trees. In particular, we investigate various problems on the sum of
edge lengths in trees of a fixed size: the minimum and the maximum value of the
sum for specific trees, the minimum and the maximum in classes of trees (bistar
trees and caterpillar trees) and finally the minimum and the maximum for any
tree. We establish some foundations for research on optimality scores for
spatial networks in one dimension.Comment: Title changed at proof stag
- …