375 research outputs found

    Arabic named entity recognition

    Full text link
    En esta tesis doctoral se describen las investigaciones realizadas con el objetivo de determinar las mejores tecnicas para construir un Reconocedor de Entidades Nombradas en Arabe. Tal sistema tendria la habilidad de identificar y clasificar las entidades nombradas que se encuentran en un texto arabe de dominio abierto. La tarea de Reconocimiento de Entidades Nombradas (REN) ayuda a otras tareas de Procesamiento del Lenguaje Natural (por ejemplo, la Recuperacion de Informacion, la Busqueda de Respuestas, la Traduccion Automatica, etc.) a lograr mejores resultados gracias al enriquecimiento que a~nade al texto. En la literatura existen diversos trabajos que investigan la tarea de REN para un idioma especifico o desde una perspectiva independiente del lenguaje. Sin embargo, hasta el momento, se han publicado muy pocos trabajos que estudien dicha tarea para el arabe. El arabe tiene una ortografia especial y una morfologia compleja, estos aspectos aportan nuevos desafios para la investigacion en la tarea de REN. Una investigacion completa del REN para elarabe no solo aportaria las tecnicas necesarias para conseguir un alto rendimiento, sino que tambien proporcionara un analisis de los errores y una discusion sobre los resultados que benefician a la comunidad de investigadores del REN. El objetivo principal de esta tesis es satisfacer esa necesidad. Para ello hemos: 1. Elaborado un estudio de los diferentes aspectos del arabe relacionados con dicha tarea; 2. Analizado el estado del arte del REN; 3. Llevado a cabo una comparativa de los resultados obtenidos por diferentes tecnicas de aprendizaje automatico; 4. Desarrollado un metodo basado en la combinacion de diferentes clasificadores, donde cada clasificador trata con una sola clase de entidades nombradas y emplea el conjunto de caracteristicas y la tecnica de aprendizaje automatico mas adecuados para la clase de entidades nombradas en cuestion. Nuestros experimentos han sido evaluados sobre nueve conjuntos de test.Benajiba, Y. (2009). Arabic named entity recognition [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8318Palanci

    First Steps Toward Developing a System for Terminology Extraction

    Get PDF
    The aim of this paper is to describe first steps in developing a system for terminology extraction. First a data sample is built from synopses of doctoral theses at the Faculty of Humanities and Social Sciences, University of Zagreb, accepted in the period from 2004 to 2009 written mostly in Croatian language. Data sample consists of 420 documents and 338,706 tokens. A small sample was manually tagged for terminology to be used in an initial experiment. The approach for terminology extraction is knowledge-driven and consists of differential analysis of reference and domain-specific corpora. Specific method used is log-likelihood ratio test. Experiment deals with different reference corpora and linguistic pre-processing. First results are promising. Further research guidelines are discussed

    Use of Verb Inflections in the Oral Expression of Agrammatic Spanish-Speaking Aphasics

    Full text link
    Studies on agrammatic verb errors have basically addressed the production of verb forms as whole lexical units without looking at their inflectional affixes. There has been limited research assessing the possible role of the variables encapsulated in verbal inflections in verb access and retrieval. The purpose of this investigation was to, first, address the possible factors causing a hierarchy of sparing in Spanish verb inflections, and, second, extend the explanatory factors proposed by earlier cross-linguistic investigations on verb inflectional performance by agrammatic speakers. This investigation studied the production of verb inflections by agrammatic Spanish speakers in a sentence repetition task. Twelve native Venezuelan Spanish-speaking subjects, six agrammatics and six controls, participated in this study. The variables predicted to have a critical role in simple and compound verb repetition were: verb form structure, daily usage frequency, theme vowel frequency, paradigmatic frequency, stress, syllabic length, and number. Two separate analyses of the subjects\u27 responses were conducted. The first analysis assessed the number of correct responses per variable feature for all the presented experimental stimuli, namely, simple and compound verb forms. The second analysis, only involving the variables that were significant in the first analysis and pairing each variable with each other, was only conducted for the correct responses for simple verb forms. Overall findings showed a hierarchy of importance of variables in verb repetition by agrammatic Spanish-speaking subjects. First, three variables consistently emerged as primary factors in successful verb repetition by the agrammatic subjects in both analyses: syllabic length, number, and daily usage frequency. Second, stress, having a crucial facilitating role in the first analysis, did not show such a strong effect in the second analysis. Third, paradigmatic frequency did not have any impact in the second analysis. Finally, conjugation class did not have a significant effect in the first analysis (and so was not used in the second analysis). These results imply that short, singular, frequently used, and, possibly, unstressed verb inflections are the most likely to be repeated correctly by Spanish-speaking agrammatics

    Discovering words and rules from speech input: an investigation into early morphosyntactic acquisition mechanisms

    Get PDF
    To acquire language proficiently, learners have to segment fluent speech into units \u2013 that is, words -, and to discover the structural regularities underlying word structure. Yet, these problems are not independent: in varying degrees, all natural languages express syntax as relations between nonadjacent word subparts. This thesis explores how developing infants come to successfully solve both tasks. The experimental work contained in the thesis approaches this issue from two complementary directions: investigating the computational abilities of infants, and assessing the distributional properties of the linguistic input directed to children. To study the nature of the computational mechanisms infants use to segment the speech stream into words, and to discover the structural regularities underlying words, I conducted seventeen artificial grammar studies. Along these experiments, I test the hypothesis that infants may use different mechanisms to learn words and word-internal rules. These mechanisms are supposed to be triggered by different signal properties, and possibly they become available at different stages of development. One mechanism is assumed to compute the distributional properties of the speech input. The other mechanism is hypothesized to be non-statistical in nature, and to project structural regularities without relying on the distributional properties of the speech input. Infants at different ages (namely, 7, 12 and 18 months) are tested in their abilities to detect statistically defined patterns, and to generalize structural regularities appearing inside word-like units. Results show that 18-month-old infants can both extract statistically defined sequences from a continuous stream (Experiment 12), and find internal-word rules only if the familiarization stream is segmented (Experiments 13 and 14). Twelve-month-olds can also segment words from a continuous stream (Experiment 5), but they cannot detect wordstraddling sequences even if they are statistically informative (Experiments 15 and 16). In contrast, they readily generalize word-internal regularities to novel instances after exposure to a segmented stream (Experiments 1-3 and 17), but not after exposure to a continuous stream (Experiment 4). Instead, 7-month-olds do not compute either statistics (Experiments 10 and 11) or within-word relations (Experiments 6 and 7), regardless of input properties. Overall, the results suggest that word segmentation and structural generalization rely on distinct mechanisms, requiring different signal properties to be activated --that is, the presence of segmentation cues is mandatory for the discovery of structural properties, while a continuous stream supports the extraction of statistically occurring patterns. Importantly, the two mechanisms have different developmental trajectories: generalizations became readily available from 12 months, while statistical computations remain rather limited along the first year. To understand how the computational selectivities and the limits of the computational mechanisms match up with the limitations and the properties of natural language, I evaluate the distributional properties of speech directed to children. These analyses aim at assessing with quantitative and qualitative measures whether the input children listen to may offer a reliable basis for the acquisition of morphosyntactic rules. I choose to examine Italian, a language with a rich and complex morphology, evaluating whether the word forms used in speech directed to children would provide sufficient evidence of the morphosyntactic rules of this language. Results show that the speech directed to children is highly systematic and consistent. The most frequently used word forms are also morphologically well-formed words in Italian: thus, frequency information correlates with structural information -- such as the morphological structure of words. While a statistical analysis of the speech input may provide a small set of words occurring with high frequency, how learners come to extract structural properties from them is another problem. In accord with the results of the infant studies, I propose that structural generalizations are projected on a different basis than statistical computations. Overall, the results of both the artificial grammar studies an the corpus analysis are compatible with the hypothesis that the tasks of segmenting words from fluent speech, and that of learning structural regularities underlying word structure rely on statistical and non-statistical cues respectively, placing constraints on computational mechanisms having different nature and selectivities in early development

    Ditransitives in germanic languages. Synchronic and diachronic aspects

    Full text link
    This volume brings together twelve empirical studies on ditransitive constructions in Germanic languages and their varieties, past and present. Specifically, the volume includes contributions on a wide variety of Germanic languages, including English, Dutch, and German, but also Danish, Swedish, and Norwegian, as well as lesser-studied ones such as Faroese. While the first part of the volume focuses on diachronic aspects, the second part showcases a variety of synchronic aspects relating to ditransitive patterns. Methodologically, the volume covers both experimental and corpus-based studies. Questions addressed by the papers in the volume are, among others, issues like the cross-linguistic pervasiveness and cognitive reality of factors involved in the choice between different ditransitive constructions, or differences and similarities in the diachronic development of ditransitives. The volume’s broad scope and comparative perspective offers comprehensive insights into well-known phenomena and furthers our understanding of variation across languages of the same family

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Towards Multilingual Coreference Resolution

    Get PDF
    The current work investigates the problems that occur when coreference resolution is considered as a multilingual task. We assess the issues that arise when a framework using the mention-pair coreference resolution model and memory-based learning for the resolution process are used. Along the way, we revise three essential subtasks of coreference resolution: mention detection, mention head detection and feature selection. For each of these aspects we propose various multilingual solutions including both heuristic, rule-based and machine learning methods. We carry out a detailed analysis that includes eight different languages (Arabic, Catalan, Chinese, Dutch, English, German, Italian and Spanish) for which datasets were provided by the only two multilingual shared tasks on coreference resolution held so far: SemEval-2 and CoNLL-2012. Our investigation shows that, although complex, the coreference resolution task can be targeted in a multilingual and even language independent way. We proposed machine learning methods for each of the subtasks that are affected by the transition, evaluated and compared them to the performance of rule-based and heuristic approaches. Our results confirmed that machine learning provides the needed flexibility for the multilingual task and that the minimal requirement for a language independent system is a part-of-speech annotation layer provided for each of the approached languages. We also showed that the performance of the system can be improved by introducing other layers of linguistic annotations, such as syntactic parses (in the form of either constituency or dependency parses), named entity information, predicate argument structure, etc. Additionally, we discuss the problems occurring in the proposed approaches and suggest possibilities for their improvement
    corecore