    Lexical database enrichment through semi-automated morphological analysis

    Derivational morphology proposes meaningful connections between words and is largely unrepresented in lexical databases. This thesis presents a project to enrich a lexical database with morphological links and to evaluate their contribution to disambiguation. A lexical database with sense distinctions was required. WordNet was chosen because of its free availability and widespread use. Its suitability was assessed through critical evaluation with respect to specifications and criticisms, using a transparent, extensible model. The identification of serious shortcomings suggested a portable enrichment methodology, applicable to alternative resources. Although 40% of the most frequent words are prepositions, they have been largely ignored by computational linguists, so addition of prepositions was also required. The preferred approach to morphological enrichment was to infer relations from phenomena discovered algorithmically. Both existing databases and existing algorithms can capture regular morphological relations, but cannot capture exceptions correctly; neither of them provide any semantic information. Some morphological analysis algorithms are subject to the fallacy that morphological analysis can be performed simply by segmentation. Morphological rules, grounded in observation and etymology, govern associations between and attachment of suffixes and contribute to defining the meaning of morphological relationships. Specifying character substitutions circumvents the segmentation fallacy. Morphological rules are prone to undergeneration, minimised through a variable lexical validity requirement, and overgeneration, minimised by rule reformulation and restricting monosyllabic output. Rules take into account the morphology of ancestor languages through co-occurrences of morphological patterns. Multiple rules applicable to an input suffix need their precedence established. The resistance of prefixations to segmentation has been addressed by identifying linking vowel exceptions and irregular prefixes. The automatic affix discovery algorithm applies heuristics to identify meaningful affixes and is combined with morphological rules into a hybrid model, fed only with empirical data, collected without supervision. Further algorithms apply the rules optimally to automatically pre-identified suffixes and break words into their component morphemes. To handle exceptions, stoplists were created in response to initial errors and fed back into the model through iterative development, leading to 100% precision, contestable only on lexicographic criteria. Stoplist length is minimised by special treatment of monosyllables and reformulation of rules. 96% of words and phrases are analysed. 218,802 directed derivational links have been encoded in the lexicon rather than the wordnet component of the model because the lexicon provides the optimal clustering of word senses. Both links and analyser are portable to an alternative lexicon. The evaluation uses the extended gloss overlaps disambiguation algorithm. The enriched model outperformed WordNet in terms of recall without loss of precision. Failure of all experiments to outperform disambiguation by frequency reflects on WordNet sense distinctions

    Towards a rule-based Spanish to Spanish sign language translation: from written forms to phonological representations

    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: noviembre de 2014This thesis addresses several aspects about the automatic translation from Castilian Spanish to Spanish Sign Language (LSE), two typologically distant languages with not enough linguistics resources enabling statistical approaches to translation. For this reason, a rule-based approach grounded on contrastive grammatical studies on both languages is used. An architecture following the analysis, transfer and generation model has been chosen. Transfer is performed at the grammatical function level, which is delivered by a Spanish dependency parser without incurring into the complexities of a more deeper analysis. The bilingual base lexicon is obtained from the Diccionario normativo de la lengua de signos española (DILSE-III), which contains the correspondences between Spanish lemmas and their SEA (Sistema de escritura alfabética) representation of signs. The lexicon is extended in two different ways: taking advantage of the difference in flexibility between the part-of-speech systems of Spanish and LSE and exploiting several lexical semantic relations, such as synonymy, hyponymy and meronymy. During the structural transfer phase, some nodes of the dependency analysis are transformed, others are removed and new nodes are inserted. Some classifier predicates are generated in this phase. Surface order generation of signs is obtained by means of the topological ordering of the graph of precedence relations between signs. Pairs of signs having head-dependent relations or sharing the same head are examined in order to determine if its relative ordering is marked or not. The system is evaluated at this point and results are compared to those obtained with statistical models. Best results are obtained with the rule-based approach, with a 0.30 BLEU (Bilingual Evaluation Understudy) and a 42% TER (Translation Error Rate). A linguistic-oriented analysis of errors is provided. Finally, in the morphological generation phase, glosses with morphological annotations are replaced by the HamNoSys (Hamburg Sign Language Notation System) phonological representations produced by a computational morphology. These representations are used for animation synthesis with avatars. The computational morphology that has been implemented uses inflection, introflection and suppletion to model a significant fragment of the LSE morphology. Among the phenomena considered, it has been implemented deictics, nominal plural, aspect marking, verbal agreement, adjectival modification and degree.Esta tesis aborda varios aspectos sobre traducción automática ed español a lengua de signos española (LSE), dos lenguas tipológicamente distantes y con insuficientes recursos lingüísticos que hagan posible aproximaciones estadísticas a la traducción. Por ese motivo, se propone una estrategia basada en reglas lingüísticas fundamentadas en los estudios gramaticales contrastivos existentes entre ambas lenguas. Se ha optado por una arquitectura para la traducción siguiendo el modelo de análisis, transferencia y generación, en la que la transferencia se realiza al nivel de las funciones gramaticales proporcionadas por un analizador de dependencias, evitando así las complejidades asociadas a un análisis lingüístico mas profundo para el español. El lexicón bilíngüe base para la transferencia léxica se ha obtenido de las entradas del Diccionario normativo de la lengua de signos española (DILSE-III), que contiene las correspondencias entre lemas en español y la representación SEA (Sistema de escritura alfabética) de los signos. Este lexicón se ha ampliado por dos vías: Aprovechando las diferencias de flexibilidad entre las clase de palabras del español y la LSE, y explotando relaciones semánticas como la sinonimia, la hiperonimia y la meronimia. Durante la transferencia estructural, algunos nodos del árbol de análisis de dependencias son transformados, otros son borrados y son insertados nuevos nodos. Algunos predicados clasificadores son generados en esta fase. La generación del orden superficial de los signos se obtiene mediante la ordenación topológica del grafo de relaciones de precedencia entre signos. Los pares de signos en nodos que mantienen la relación núcleodependiente o son dependientes de un mismo signo son examinados para determinar si su orden relativo está marcado o no. El sistema de traducción es evaluado en este punto utilizando un corpus y comparado con el resultado obtenido con distintos modelos de traducción estadística. Sobre un corpus de control de glosas, el sistema basado en reglas obtiene mejores resultados, con un BLEU (Bilingual Evaluation Understudy) del 0,30 y un TER (Translation Error Rate) del 42%. Sobre los resultados se ha realizado un análisis de los errores. Finalmente, para la generación morfológica, las glosas junto con sus correspondientes anotaciones morfológicas son reemplazadas por las representaciones fonológicas Ham- NoSys producidas por una morfología computacional y usables para la síntesis de animaciones mediante avatares. La morfología implementada usa flexión, introflexión y supleción para modelar un fragmento bastante amplio de la LSE. Entre los fenómenos tratados se incluyen la deixis, la realización de los distintos tipos de plural nominal, el aspecto, la concordancia argumental del verbo, la modificación adjetival y el grado