180 research outputs found
Word segmentation for Akkadian cuneiform
We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurally resembles Eastern Asian writing systems, so, we employ word segmentation algorithms originally developed for Chinese and Japanese. We describe results of rule-based algorithms, dictionary-based algorithms, statistical and machine learning approaches. Our results may indicate possible promising steps in cuneiform word segmentation that can create and improve natural language processing in this area
Language and Dialect Identification of Cuneiform Texts
This article introduces a corpus of cuneiform texts from which the dataset
for the use of the Cuneiform Language Identification (CLI) 2019 shared task was
derived as well as some preliminary language identification experiments
conducted using that corpus. We also describe the CLI dataset and how it was
derived from the corpus. In addition, we provide some baseline language
identification results using the CLI dataset. To the best of our knowledge, the
experiments detailed here are the first time automatic language identification
methods have been used on cuneiform data
Machine learning for ancient languages: a survey
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning
Restoration of Fragmentary Babylonian Texts Using Recurrent Neural Networks
The main source of information regarding ancient Mesopotamian history and
culture are clay cuneiform tablets. Despite being an invaluable resource, many
tablets are fragmented leading to missing information. Currently these missing
parts are manually completed by experts. In this work we investigate the
possibility of assisting scholars and even automatically completing the breaks
in ancient Akkadian texts from Achaemenid period Babylonia by modelling the
language using recurrent neural networks
Text segmentation for analysing different languages
Over the past several years, researchers have applied different methods of text segmentation. Text segmentation is defined as a method of splitting a document into smaller segments, assuming with its own relevant meaning. Those segments can be classified into the tag, word, sentence, topic, phrase and any information unit. Firstly, this study reviews the different types of text segmentation methods used in different types of documentation, and later discusses the various reasons for utilizing it in opinion mining. The main contribution of this study includes a summarisation of research papers from the past 10 years that applied text segmentation as their main approach in text analysing. Results show that word segmentation was successfully and widely used for processing different languages
Chapter 26 Language technology approach to âseeingâ in Akkadian
One of the ways meanings of words can be understood is based on their distributional properties. Such methodology offers an interesting quantitative viewpoint on the study of the lexicography of long-extinct languages. This chapter explores the use of Pointwise Mutual Information (PMI), a well-known statistical word association measure used in collocation analysis. PMI is applied to the data in order to gain insights on the semantic nuances of Akkadian verbs of seeing (amÄru, naáčÄlu, palÄsu, dagÄlu, áž«iÄtu, barĂ», and subbĂ»). To evaluate the data-driven results, the findings are compared to previous philological work by Ainsley Dicks. The analysis of the top-ranked PMI-extracted collocates provides a good overview of the typical semantic differences between the seven verbs of interest
Investigating Machine Learning Methods for Language and Dialect Identification of Cuneiform Texts
Identification of the languages written using cuneiform symbols is a
difficult task due to the lack of resources and the problem of tokenization.
The Cuneiform Language Identification task in VarDial 2019 addresses the
problem of identifying seven languages and dialects written in cuneiform;
Sumerian and six dialects of Akkadian language: Old Babylonian, Middle
Babylonian Peripheral, Standard Babylonian, Neo-Babylonian, Late Babylonian,
and Neo-Assyrian. This paper describes the approaches taken by SharifCL team to
this problem in VarDial 2019. The best result belongs to an ensemble of Support
Vector Machines and a naive Bayes classifier, both working on character-level
features, with macro-averaged F1-score of 72.10%
- âŠ