Search CORE

34 research outputs found

Finite-State Locality in Semitic Root-and-Pattern Morphology

Author: Dolatian Hossep
Rawski Jonathan
Publication venue: ScholarlyCommons
Publication date: 01/10/2020
Field of study

This paper discusses the generative capacity required for Semitic root-and-pattern morphology. Finite-state methods effectively compute concatenative morpho-phonology, and can be restricted to Strictly Local functions. We extend these methods to consider non-concatenative morphology. We show that over such multi-input functions, Strict Locality is necessary and sufficient. We discuss some consequences of this generalization for linguistic theories of the morphological template

ScholarlyCommons@Penn

Recommended from our members

Multi-Input Strictly Local Functions for Templatic Morphology

Author: Dolatian Hossep
Rawski Jonathan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2020
Field of study

This paper presents an automata-theoretic characterization of templatic morphology. We generalize the Input Strictly Local class of functions, which characterize a majority of concatenative morphology, to consider multiple lexical inputs. We show that strictly local asynchronous multi-tape transducers successfully capture this typology of nonconcatenative template filling. This characterization and restriction uniquely opens up representational issues in morphological computatio

ScholarWorks@UMass Amherst

Recommended from our members

Strong Generative Capacity of Morphological Processes

Author: Dolatian Hossep
Heinz Jeffrey
Rawski Jonathan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2021
Field of study

Morphological processes are generally computable with 1-way finite-state transducers. However, we show that 1-way transducers do not capture the strong generative capacity of certain morphological analyses for more complex processes, including mobile affixation, infixation, and partial reduplication. As diagnostics for strong generative capacity, we use origin semantics and order-preservation. These analyze the input-output correspondences generated by finite-state transducers and their corresponding logical transductions. For some linguistic analyses of these complex processes, their strong generative capacity is matched by more expressive grammars, such as non-order-preserving transductions and their corresponding 2-way finite-state transducers

ScholarWorks@UMass Amherst

Finite-State Technology as a Programming Environment

Author: C.D. Johnson
G. Noord van
J. Daciuk
J.W. Amtrup
K.R. Beesley
M. Holzer
M. Mohri
M. Mohri
M. Mohri
M. Silberztein
R.C. Carrasco
R.M. Kaplan
Y. Cohen-Sygal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Crossref

Finite-State Description of Vietnamese Reduplication

Author: Le Hong Phuong
Nguyen Thi Minh Huyen
Roussanaly Azim
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

International audienceWe present for the first time a computational model for the reduplication of the Vietnamese language. Reduplication is a popular phenomenon of Vietnamese in which reduplicative words are created by the combination of multiple syllables whose phonics are similar. We first give a systematical study of Vietnamese reduplicative words, bringing into focus clear principles for the formation of a large class of bi-syllabic reduplicative words. We then make use of optimal finite-state devices, in particular minimal sequential string-to string transducers to build a computational model for very efficient recognition and production of those words. Finally, several nice applications of this computational model are discussed

Crossref

INRIA a CCSD electronic archive server

Recommended from our members

RedTyp: A Database of Reduplication with Computational Models

Author: Dolatian Hossep
Heinz Jeffrey
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2019
Field of study

Reduplication is a theoretically and typologically well-studied phenomenon, but there is no database of reduplication patterns which include explicit computational models. This paper introduces RedTyp, an SQL database which provides a computational resource that can be used by both theoretical and computational linguists who work on reduplication. It catalogs 138 reduplicative morphemes across 91 languages, which are modeled with 57 distinct finite-state machines. The finite-state machines are 2-way transducers, which provide an explicit, compact, and convenient representation for reduplication patterns, and which arguably capture the linguistic generalizations more directly than the more commonly used 1-way transducers for modeling natural language morphophonology

ScholarWorks@UMass Amherst

Computational morphology and Bantu language learning:an implementation for Runyakitara

Author: Katushemererwe Fridah
Publication venue: s.n.
Publication date: 01/01/2013
Field of study

ARTS repository - University of Groningen

Probabilistic Modelling of Morphologically Rich Languages

Author: Botha Jan A.
Publication venue
Publication date: 01/01/2014
Field of study

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c

arXiv.org e-Print Archive

Oxford University Research Archive

On regular copying languages

Author: Tim Hunter
Yang Wang
Publication venue: Institute of Computer Science, Polish Academy of Sciences
Publication date: 01/07/2023
Field of study

This paper proposes a formal model of regular languages enriched with unbounded copying. We augment finite-state machinery with the ability to recognize copied strings by adding an unbounded memory buffer with a restricted form of first-in-first-out storage. The newly introduced computational device, finite-state buffered machines (FS-BMs), characterizes the class of regular languages and languages de-rived from them through a primitive copying operation. We name this language class regular copying languages (RCLs). We prove a pumping lemma and examine the closure properties of this language class. As suggested by previous literature (Gazdar and Pullum 1985, p.278), regular copying languages should approach the correct characteriza-tion of natural language word sets

Directory of Open Access Journals

A computational model of modern standard arabic verbal morphology based on generation

Author: González Martínez Alicia
Publication venue
Publication date: 01/01/2013
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literataura Comparada. Fecha de lectura: 29-01-2013The computational handling of non-concatenative morphologies is still a challenge in the field of natural language processing. Amongst the various areas of research, Arabic morphology stands out due to its highly complex structure. We propose a model for Arabic verbal morphology based on a root-and-pattern approach, which satisfies both computational consistency and an elegant formalization. Our model defines an abstract representation of prosodic templates and a set of intertwined morphemes that operate at different phonological levels, as well as a separate module of rewrite rules to deal with morphophonological and orthographic alterations. Our verbal system model asserts that Arabic exhibits two conjugational classes. The computational system, named Jabalín, is focused on generation—the program generates a full annotated lexicon of verbal forms, which is subsequently used to develop a morphological analyzer and generator. The input of the system consists of a lexicon of 15,452 verb lemmas of both Classical Arabic and Modern Standard Arabic—taken from El-Dahdah (1991)—comprising a total of 3,706 roots. The output of the system is a lexicon of 1,684,268 verbal inflected forms. We carried out an evaluation against a lexicon of inflected verbs provided by the analyzer ElixirFM (Smrž, 2007a; 2007b), which we considered a Golden Standard, achieving a precision of 99.52%. Additionally, we compared our lexicon with a list of the most frequent verb lemmas—including the most frequent verbs from each conjugation—taken from Buckwalter and Parkinson (2010). The list includes 825 verbs which are all included in our lexicon and passed an evaluation test with 99.27% of accuracy. Jabalín is available under a GNU license, and can be accessed and tested through an online interface, at http://elvira.lllf.uam.es/jabalin/, hosted at the LLI-UAM lab. The Jabalín interface provides different functionalities: analyze a form, generate the inflectional paradigm of a verb lemma, derive a root, show quantitative data, and explore the database, which includes data from the evaluation. ii Key words: Computational Linguistics, Natural Language Processing, Arabic Computational Morphology, Root-and-Pattern Morphology, Non-concatenative Morphology, Templatic Morphology, Root-and-Prosody Morphology, Computational Prosodic Morphology.Los sistemas morfológicos de tipo no concatenativo siguen siendo uno de los mayores retos para el procesamiento del lenguaje natural. Entre las diversas líneas de investigación, el estudio de la morfología del árabe destaca por ser un sistema de gran complejidad estructural. En el presente proyecto de investigación, se propone un modelo de morfología verbal del árabe basado en un enfoque root-and-pattern, así como formalmente elegante y coherente desde el punto de vista computacional. El modelo propuesto se apoya fundamentalmente en una formalización abstracta de los esquemas prosódicos y su interrelación con el material morfológico. Paralelamente, el sistema cuenta con un módulo de reglas que tratan las alteraciones morfofonológicas y ortográficas del árabe. El modelo del sistema verbal propone, y se asienta en la idea de que, existen sólo dos clases conjugacionales en árabe. El sistema computacional, llamado Jabalín, está orientado a la generación: el programa genera un lexicón de formas verbales con la información lingüística asociada. El lexicón se emplea a continuación para desarrollar un analizador y generador morfológicos. Como entrada, el sistema recibe un lexicón de lemas verbales de 15.452 entradas (tomado de El-Dahdah, 1991), que combina léxico tanto del árabe clásico como del árabe estándar moderno, y cuenta con un total de 3.706 raíces. La salida es un lexicón de 1.684.268 formas verbales flexionadas. Se ha llevado a cabo una evaluación contra un lexicón de formas verbales extraído del analizador ElixirFM (Smrž, 2007a; 2007b), con una precisión de 99,52%. Por otro lado, el lexicón se ha evaluado también contra una lista de verbos más frecuentes (incluyendo los lemas más frecuentes de cada tipo de conjugación) sacada de Buckwalter y Parkinson (2010). El total de los 825 verbos que componen la lista están incluidos en nuestro lexicón de lemas verbales y presentan una precisión del 99.27%. El sistema Jabalín, desarrollado bajo licencia GNU, cuenta además con una interfaz web donde se pueden realizar consultas en árabe, http://elvira.lllf.uam.es/jabalin/, albergada en el LLI-UAM. La interfaz cuenta iv con varias funcionalidades: analizar forma, generar flexión de un lema verbal, derivar raíz, mostrar datos cuantitativos, y explorar la base de datos, que incluye los datos de la evaluación. Palabras clave: Lingüística Computacional, Procesamiento del Lenguaje Natural, Morfología Computacional del Árabe, morfología root-and-pattern, morfología no-concatenativa, morfología templática, morfología root-and-prosody, morfología prosódica computacional

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo