27 research outputs found

    A new System for offline Printed Arabic Recognition for Large Vocabulary : SPARLV

    Get PDF
    This paper presents a contribution for the Arabic printed recognition. In fact, we are interested in the printed decomposable Arabic word recognition. The proposed system uses the analytical approach through the segmentation into characters to succeed to a generation of letter hypotheses as well as word hypotheses using a lexical verification in a pre-established dictionary of the language. Our proposed system SPARLV is able to put valid hypotheses of words thanks to the lexical verification

    A computational model of modern standard arabic verbal morphology based on generation

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literataura Comparada. Fecha de lectura: 29-01-2013The computational handling of non-concatenative morphologies is still a challenge in the field of natural language processing. Amongst the various areas of research, Arabic morphology stands out due to its highly complex structure. We propose a model for Arabic verbal morphology based on a root-and-pattern approach, which satisfies both computational consistency and an elegant formalization. Our model defines an abstract representation of prosodic templates and a set of intertwined morphemes that operate at different phonological levels, as well as a separate module of rewrite rules to deal with morphophonological and orthographic alterations. Our verbal system model asserts that Arabic exhibits two conjugational classes. The computational system, named Jabalín, is focused on generation—the program generates a full annotated lexicon of verbal forms, which is subsequently used to develop a morphological analyzer and generator. The input of the system consists of a lexicon of 15,452 verb lemmas of both Classical Arabic and Modern Standard Arabic—taken from El-Dahdah (1991)—comprising a total of 3,706 roots. The output of the system is a lexicon of 1,684,268 verbal inflected forms. We carried out an evaluation against a lexicon of inflected verbs provided by the analyzer ElixirFM (Smrž, 2007a; 2007b), which we considered a Golden Standard, achieving a precision of 99.52%. Additionally, we compared our lexicon with a list of the most frequent verb lemmas—including the most frequent verbs from each conjugation—taken from Buckwalter and Parkinson (2010). The list includes 825 verbs which are all included in our lexicon and passed an evaluation test with 99.27% of accuracy. Jabalín is available under a GNU license, and can be accessed and tested through an online interface, at http://elvira.lllf.uam.es/jabalin/, hosted at the LLI-UAM lab. The Jabalín interface provides different functionalities: analyze a form, generate the inflectional paradigm of a verb lemma, derive a root, show quantitative data, and explore the database, which includes data from the evaluation. ii Key words: Computational Linguistics, Natural Language Processing, Arabic Computational Morphology, Root-and-Pattern Morphology, Non-concatenative Morphology, Templatic Morphology, Root-and-Prosody Morphology, Computational Prosodic Morphology.Los sistemas morfológicos de tipo no concatenativo siguen siendo uno de los mayores retos para el procesamiento del lenguaje natural. Entre las diversas líneas de investigación, el estudio de la morfología del árabe destaca por ser un sistema de gran complejidad estructural. En el presente proyecto de investigación, se propone un modelo de morfología verbal del árabe basado en un enfoque root-and-pattern, así como formalmente elegante y coherente desde el punto de vista computacional. El modelo propuesto se apoya fundamentalmente en una formalización abstracta de los esquemas prosódicos y su interrelación con el material morfológico. Paralelamente, el sistema cuenta con un módulo de reglas que tratan las alteraciones morfofonológicas y ortográficas del árabe. El modelo del sistema verbal propone, y se asienta en la idea de que, existen sólo dos clases conjugacionales en árabe. El sistema computacional, llamado Jabalín, está orientado a la generación: el programa genera un lexicón de formas verbales con la información lingüística asociada. El lexicón se emplea a continuación para desarrollar un analizador y generador morfológicos. Como entrada, el sistema recibe un lexicón de lemas verbales de 15.452 entradas (tomado de El-Dahdah, 1991), que combina léxico tanto del árabe clásico como del árabe estándar moderno, y cuenta con un total de 3.706 raíces. La salida es un lexicón de 1.684.268 formas verbales flexionadas. Se ha llevado a cabo una evaluación contra un lexicón de formas verbales extraído del analizador ElixirFM (Smrž, 2007a; 2007b), con una precisión de 99,52%. Por otro lado, el lexicón se ha evaluado también contra una lista de verbos más frecuentes (incluyendo los lemas más frecuentes de cada tipo de conjugación) sacada de Buckwalter y Parkinson (2010). El total de los 825 verbos que componen la lista están incluidos en nuestro lexicón de lemas verbales y presentan una precisión del 99.27%. El sistema Jabalín, desarrollado bajo licencia GNU, cuenta además con una interfaz web donde se pueden realizar consultas en árabe, http://elvira.lllf.uam.es/jabalin/, albergada en el LLI-UAM. La interfaz cuenta iv con varias funcionalidades: analizar forma, generar flexión de un lema verbal, derivar raíz, mostrar datos cuantitativos, y explorar la base de datos, que incluye los datos de la evaluación. Palabras clave: Lingüística Computacional, Procesamiento del Lenguaje Natural, Morfología Computacional del Árabe, morfología root-and-pattern, morfología no-concatenativa, morfología templática, morfología root-and-prosody, morfología prosódica computacional

    Lexical Structure and the Nature of Linguistic Representations

    Get PDF
    This dissertation addresses a foundational debate regarding the role of structure and abstraction in linguistic representation, focusing on representations at the lexical level. Under one set of views, positing abstract morphologically-structured representations, words are decomposable into morpheme-level basic units; however, alternative views now challenge the need for abstract structured representation in lexical representation, claiming non-morphological whole-word storage and processing either across-the-board or depending on factors like transparency/productivity/surface form. Our cross-method/cross-linguistic results regarding morphological-level decomposition argue for initial, automatic decomposition, regardless of factors like semantic transparency, surface formal overlap, word frequency, and productivity, contrary to alternative views of the lexicon positing non-decomposition for some or all complex words. Using simultaneous lexical decision and time-sensitive brain activity measurements from magnetoencephalography (MEG), we demonstrate effects of initial, automatic access to morphemic constituents of compounds, regardless of whole-word frequency, lexicalization and length, both in the psychophysical measure (response time) and in the MEG component indexing initial lexical activation (M350), which we also utilize to test distinctions in lexical representation among ambiguous words in a further experiment. Two masked priming studies further demonstrate automatic decomposition of compounds into morphemic constituents, showing equivalent facilitation regardless of semantic transparency. A fragment-priming study with spoken Japanese compounds argues that compounds indeed activate morphemic candidates, even when the surface form of a spoken compound fragment segmentally-mismatches its potential underlying morpheme completion due to a morpho-phonological alternation (rendaku), whereas simplex words do not facilitate segment-mismatching continuations, supporting morphological structure-based prediction regardless of surface-form overlap. A masked priming study on productive and non-productive Japanese de-adjectival nominal derivations shows priming of constituents regardless of productivity, and provides evidence that affixes have independent morphological-level representations. The results together argue that the morpheme, not the word, is the basic unit of lexical processing, supporting a view of lexical representations in which there are abstract morphemes, and revealing immediate, automatic decomposition regardless of semantic transparency, morphological productivity, and surface formal overlap, counter to views in which some/all complex words are treated as unanalyzed wholes. Instead, we conclude that morphologically-complex words are decomposed into abstract morphemic units immediately and automatically by rule, not by exception

    Word Knowledge and Word Usage

    Get PDF
    Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines

    Affixal Approach versus Analytical Approach for Off-Line Arabic Decomposable Vocabulary Recognition

    No full text
    corecore