34 research outputs found

    La expresión oral en español como lengua extranjera: interlengua y análisis de errores basado en corpus

    Get PDF
    PhD Thesis written by Leonardo Campillos Llanos under the supervision of Dr. Antonio Moreno Sandoval and Dr. Paula Gozalo Gómez (Universidad Autónoma de Madrid). The thesis was defended on December 17th, 2012, at the Facultad de Filosofía y Letras (Universidad Autónoma de Madrid), and the committee consisted of: Dr. Francisco Marcos Marín (University of Texas at San Antonio), Dr. Joaquín Garrido (Universidad Complutense de Madrid), Dr. Sonsoles Fernández López (Escuela Oficial de Idiomas), Dr. Isabel García Parejo (Universidad Complutense de Madrid), and Dr. Ana Serradilla (Universidad Autónoma de Madrid). The PhD thesis was awarded Summa cum laude (International Doctorate).Tesis realizada por Leonardo Campillos Llanos y dirigida por los doctores Antonio Moreno Sandoval y Paula Gozalo Gómez (Universidad Autónoma de Madrid). Fue defendida el 17 de diciembre del 2012 en la Facultad de Filosofía y Letras (Universidad Autónoma de Madrid) ante un tribunal formado por los doctores Francisco Marcos Marín (University of Texas at San Antonio), Joaquín Garrido (Universidad Complutense de Madrid), Sonsoles Fernández López (Escuela Oficial de Idiomas), Isabel García Parejo (Universidad Complutense de Madrid), y Ana Serradilla (Universidad Autónoma de Madrid). La tesis obtuvo la calificación de Sobresaliente cum laude y la mención de doctorado internacional

    La expresión oral en español lengua extranjera: interlengua y análisis de errores basado en corpus

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literatura Comparada. Fecha de lectura: 17-12-201

    A quantitative study of disfluencies in formal, informal and media spontaneous speech in Spanish

    Full text link
    Proceedings of IberSpeech 2012 (Madrid, Spain)A descriptive study of the prevalence of different types of disfluencies (fragmented words, restarts and vocalic supports) in spontaneous Spanish is presented based on a hand-annotated corpus. A quantitative account of differences among three types of registers (formal, informal and media) and several subtypes of text for each register is provided to analyze the importance of each disfluency class for a given register

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    Construcción de un corpus comparable y un recurso de referencia para la simplificación de textos médicos en español

    Get PDF
    We report the collection of the CLARA-MeD comparable corpus, which is made up of 24 298 pairs of professional and simplified texts in the medical domain for the Spanish language (>96M tokens). Texts types range from drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words), abstracts of systematic reviews (8138 pairs of texts, >9M words), cancer-related information summaries (201 pairs of texts, >3M tokens) and clinical trials announcements (5748 pairs of texts, 451 690 words). We also report the alignment of professional and simplified sentences, conducted manually by pairs of annotators. A subset of 3800 sentence pairs (149 862 tokens) has been aligned each by 2 experts, with an average inter-annotator agreement kappa score of 0.839 (0.076). The data are available in the community and contributes with a new benchmark to develop and evaluate automatic medical text simplification systems.Se describe la recogida del corpus comparable CLARA-MeD, formado por 24 298 pares de textos profesionales y simplificados de dominio médico en lengua española (>96M palabras). Los tipos de textos varían desde prospectos médicos y fichas técnicas de medicamentos (10 211 pares de textos, >82M palabras), resúmenes de revisiones sistemáticas (8138 pares de textos, >9M palabras), resúmenes de información sobre el cáncer (201 pares de textos, >3M palabras) y anuncios de ensayos clínicos (5748 pares de textos, 451 690 palabras). También presentamos el alineamiento de frases técnicas y simplificadas, realizado a mano por pares de anotadores. Un subconjunto de 3800 pares de frases (149 862 tokens) se han emparejado, con un acuerdo medio entre anotadores con valor kappa = 0.839 (0.076). Los datos están disponibles en la comunidad y este nuevo recurso permite desarrollar y evaluar sistemas de simplificación automática de textos médicos.Project CLARA-MED (PID2020-116001RA-C33) funded by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”

    Report on reusable documents as language resources in Spain, under the Government Plan for Language Technologies

    Get PDF
    Este estudio ha sido realizado dentro del ámbito del Plan de impulso de las Tecnologías del Lenguaje (Plan TL) con financiación de la Secretaría de Estado para el Avance Digital y Red.es. Los objetivos centrales son realizar un censado de recursos de las diferentes administraciones públicas que puedan ser convertidos en RL, así como proponer un plan de acción para abordar su conversión en RL. Se ha elaborado una metodología específica para el censado y evaluación de la madurez de los datos. Se han generado dos listados, uno preliminar compuesto por 101 recursos, del que se han seleccionado 24 para su análisis detallado y evaluación. El informe también incluye un repaso de estudios similares en otros países. Concluye con unas recomendaciones genéricas, así como estrategias concretas para los recursos seleccionados. El informe final y los listados están disponibles públicamente en Red.es y la página del Plan TL.This report was carried out within the Spanish administration-driven initiative Language Technologies Plan (Plan TL), funded by Secretaría de Estado para el Avance Digital and Red.es. The main goals are collecting from Spanish public administrations a listing of provided resources and open data that can be transformed to language resources, as well as proposing an action plan to process and distribute them. We designed a specific methodology for listing and evaluating the degree of maturity of the considered data. We created two listings: a preliminary collection of 101 resources, and 24 resources and data repositories selected from the first list for a detailed analysis and evaluation. This report also features a comparative analysis of similar initiatives and studies conducted abroad. We conclude with generic recommendations and detailed strategies for the selected resources. The report and listings are publicly available at Red.es and the Plan TL. website.Este informe ha sido financiado por la Secretaría de Estado para el Avance Digital (SEAD) y Red.es

    Lexical errors in non-native oral Spanish: a corpus-based error analysis

    No full text
    Se analizan los errores léxico-semánticos en la producción oral de cuarenta aprendices de español. Los datos pertenecen a un corpus de interlengua compuesto por entrevistas con universitarios de más de 9 lenguas maternas y de nivel intermedio (A2 y B1, Marco Común Europeo de Referencia). La metodología es la investigación en corpus de estudiantes, en concreto el análisis de errores asistido por ordenador. Los resultados muestran que los errores formales son más frecuentes en A2, y que no abundan los semánticos, pero persisten aumentando ligeramente en B1.This study analyses the lexical errors in the oral production of forty learners of Spanish. Data belong to a learner corpus of oral interviews with university learners from over nine language backgrounds at intermediate level: A2 (N=20) and B1 (N=20) (Common European Framework of Reference). Our methodology is Learner Corpus Research, specifically Computer-aided Error Analysis. The results in our corpus show that formal errors are more frequent at A2 level, whereas semantic errors, although being less abundant, persist and slightly increase at B1.Trabajo financiado por la Comunidad de Madrid y el Fondo Social Europeo mediante un contrato de investigación predoctoral

    Medical Lexicon for Spanish (MedLexSp)

    No full text
    - MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.ptFile List: 1) MedLexSp.dsv; 2) MedLexSp.xml and lmf.dtd (Document Type Definition); 3) Lexical Record files: in subfolder "LR/": 3.1) LR_abr.dsv; 3.2) LR_affix.dsv; 3.3) LR_n_v.dsv; 3.4) LR_adj_n.dsv; 4) Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py 5) Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt See more information about the format below. Companion code and files can be found in the github repository: https://github.com/lcampillos/MedLexSpMedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language.MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs).Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms).This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación".Peer reviewe
    corecore