10,140 research outputs found
The "handedness" of language: Directional symmetry breaking of sign usage in words
Language, which allows complex ideas to be communicated through symbolic
sequences, is a characteristic feature of our species and manifested in a
multitude of forms. Using large written corpora for many different languages
and scripts, we show that the occurrence probability distributions of signs at
the left and right ends of words have a distinct heterogeneous nature.
Characterizing this asymmetry using quantitative inequality measures, viz.
information entropy and the Gini index, we show that the beginning of a word is
less restrictive in sign usage than the end. This property is not simply
attributable to the use of common affixes as it is seen even when only word
roots are considered. We use the existence of this asymmetry to infer the
direction of writing in undeciphered inscriptions that agrees with the
archaeological evidence. Unlike traditional investigations of phonotactic
constraints which focus on language-specific patterns, our study reveals a
property valid across languages and writing systems. As both language and
writing are unique aspects of our species, this universal signature may reflect
an innate feature of the human cognitive phenomenon.Comment: 10 pages, 4 figures + Supplementary Information (15 pages, 8
figures), final corrected versio
Network analysis of a corpus of undeciphered Indus civilization inscriptions indicates syntactic organization
Archaeological excavations in the sites of the Indus Valley civilization
(2500-1900 BCE) in Pakistan and northwestern India have unearthed a large
number of artifacts with inscriptions made up of hundreds of distinct signs. To
date there is no generally accepted decipherment of these sign sequences and
there have been suggestions that the signs could be non-linguistic. Here we
apply complex network analysis techniques to a database of available Indus
inscriptions, with the aim of detecting patterns indicative of syntactic
organization. Our results show the presence of patterns, e.g., recursive
structures in the segmentation trees of the sequences, that suggest the
existence of a grammar underlying these inscriptions.Comment: 17 pages (includes 4 page appendix containing Indus sign list), 14
figure
Language and Dialect Identification of Cuneiform Texts
This article introduces a corpus of cuneiform texts from which the dataset
for the use of the Cuneiform Language Identification (CLI) 2019 shared task was
derived as well as some preliminary language identification experiments
conducted using that corpus. We also describe the CLI dataset and how it was
derived from the corpus. In addition, we provide some baseline language
identification results using the CLI dataset. To the best of our knowledge, the
experiments detailed here are the first time automatic language identification
methods have been used on cuneiform data
Statistical analysis of the tables in Mahadevan’s Concordance of the Indus Valley Script
NJQL-2017-0037R2The Indus Script originates from the culture known as the Indus Valley Civilization which flourished from approximately 2600 to 1900 BC. Several thousand objects bearing these signs have been found over a wide area of Northern India and Pakistan. In 1977 Iravatham Mahadevan published a concordance of all of the scripts that had been discovered so far. Accompanying the concordance are a set of 9 tables showing the distribution of individual signs by position, archaeological site, object type, field symbol (accompanying image), and direction of writing. Analysis of the frequencies of the signs found so far using Large Numbers of Rare Events (LNRE) models enabled the total vocabulary of the language, including signs not yet found, to be about 857. All the tables were analysed using Pearson’s residuals, and it was found that the signs were not randomly distributed, but some showed statistically significant associations with position, object, field symbol or direction of writing. A more detailed analysis of the relation between signs and field symbols was made using correspondence analysis, which showed that certain signs were associated with the unicorn symbol, while others were associated with the gharial and dotted circle symbols
Statistical analysis of the Indus script using -grams
The Indus script is one of the major undeciphered scripts of the ancient
world. The small size of the corpus, the absence of bilingual texts, and the
lack of definite knowledge of the underlying language has frustrated efforts at
decipherment since the discovery of the remains of the Indus civilisation.
Recently, some researchers have questioned the premise that the Indus script
encodes spoken language. Building on previous statistical approaches, we apply
the tools of statistical language processing, specifically -gram Markov
chains, to analyse the Indus script for syntax. Our main results are that the
script has well-defined signs which begin and end texts, that there is
directionality and strong correlations in the sign order, and that there are
groups of signs which appear to have identical syntactic function. All these
require no {\it a priori} suppositions regarding the syntactic or semantic
content of the signs, but follow directly from the statistical analysis. Using
information theoretic measures, we find the information in the script to be
intermediate between that of a completely random and a completely fixed
ordering of signs. Our study reveals that the Indus script is a structured sign
system showing features of a formal language, but, at present, cannot
conclusively establish that it encodes {\it natural} language. Our -gram
Markov model is useful for predicting signs which are missing or illegible in a
corpus of Indus texts. This work forms the basis for the development of a
stochastic grammar which can be used to explore the syntax of the Indus script
in greater detail
Iravatham Mahadevan’s Reading of Indus Script: A Critical Review
This paper comprehensively summarizes, analyses, and reviews Iravatham Mahadevan’s attempts to decipher the Indus script. Spanning a period of over thirty five years, Iravatham Mahadevan made continuous attempts to interpret and decipher the Indus script. Mahadevan claimed to have adapted the method of parallels between the symbolic representation and the text, between the written object and its designation, between the written symbol itself and its meaning, and the similarity throughout the ancient East of certain portions of the inscriptions, with the assumption that the underlying language of the script is Dravidian. Mahadevan was very flexible in changing his views and finding new interpretations, and gradually he shifted his interpretation of Indus signs from being phonetic/logographic/word to ideographic, leaving unshaken his core personal hypothesis and belief in the Veḷier clan and Tamil cultural settings. While Mahadevan did not succeed in making a self-consistent system of readings applicable to a large number of discovered pieces of writings, he did make a determined, persistent effort to develop a Dravidian framework for deciphering of the Indus script. This study seeks to find weaknesses in the methodology and assumptions of Mahadevan and searches for possible alternatives within that framework
Data Mining Ancient Script Image Data Using Convolutional Neural Networks
The recent surge in ancient scripts has resulted in huge image libraries of ancient texts. Data mining of the collected images enables the study of the evolution of these ancient scripts. In particular, the origin of the Indus Valley script is highly debated. We use convolutional neural networks to test which Phoenician alphabet letters and Brahmi symbols are closest to the Indus Valley script symbols. Surprisingly, our analysis shows that overall the Phoenician alphabet is much closer than the Brahmi script to the Indus Valley script symbols
A method of identifying allographs in undeciphered scripts and its application to the Indus Valley Script
This work describes a general method of testing for redundancies in the sign lists of ancient scripts by data mining the positions of the signs within the inscriptions. The redundant signs are allographs of the same grapheme. The method is applied to the undeciphered Indus Valley Script, which stands out from other ancient scripts by having a large proposed sign list that contains dozens of asymmetric signs that have mirrored pairs. By a statistical analysis of mirrored asymmetric signs, this paper shows that the Indus Valley Script was multi-directional and the mirroring of signs often denotes only the direction of writing without any difference in meaning. For this and five other specific reasons listed in the paper, 50 pairs of signs, 23 mirrored, and 27 non-mirrored, can be grouped together because each pair consists of only insignificant variations of the same original sign. The reduced sign list may make decipherment easier in the future
- …