Search CORE

34 research outputs found

La expresión oral en español como lengua extranjera: interlengua y análisis de errores basado en corpus

Author: Campillos Llanos Leonardo
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2014
Field of study

PhD Thesis written by Leonardo Campillos Llanos under the supervision of Dr. Antonio Moreno Sandoval and Dr. Paula Gozalo Gómez (Universidad Autónoma de Madrid). The thesis was defended on December 17th, 2012, at the Facultad de Filosofía y Letras (Universidad Autónoma de Madrid), and the committee consisted of: Dr. Francisco Marcos Marín (University of Texas at San Antonio), Dr. Joaquín Garrido (Universidad Complutense de Madrid), Dr. Sonsoles Fernández López (Escuela Oficial de Idiomas), Dr. Isabel García Parejo (Universidad Complutense de Madrid), and Dr. Ana Serradilla (Universidad Autónoma de Madrid). The PhD thesis was awarded Summa cum laude (International Doctorate).Tesis realizada por Leonardo Campillos Llanos y dirigida por los doctores Antonio Moreno Sandoval y Paula Gozalo Gómez (Universidad Autónoma de Madrid). Fue defendida el 17 de diciembre del 2012 en la Facultad de Filosofía y Letras (Universidad Autónoma de Madrid) ante un tribunal formado por los doctores Francisco Marcos Marín (University of Texas at San Antonio), Joaquín Garrido (Universidad Complutense de Madrid), Sonsoles Fernández López (Escuela Oficial de Idiomas), Isabel García Parejo (Universidad Complutense de Madrid), y Ana Serradilla (Universidad Autónoma de Madrid). La tesis obtuvo la calificación de Sobresaliente cum laude y la mención de doctorado internacional

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

La expresión oral en español lengua extranjera: interlengua y análisis de errores basado en corpus

Author: Campillos Llanos Leonardo
Publication venue
Publication date: 01/01/2012
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Filosofía y Letras, Departamento de Lingüística, Lenguas Modernas, Lógica y Fª de la Ciencia y Tª de la Literatura y Literatura Comparada. Fecha de lectura: 17-12-201

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

A quantitative study of disfluencies in formal, informal and media spontaneous speech in Spanish

Author: Campillos Llanos Leonardo
Moreno Sandoval Antonio
Toledano Doroteo T.
Publication venue
Publication date: 22/11/2012
Field of study

Proceedings of IberSpeech 2012 (Madrid, Spain)A descriptive study of the prevalence of different types of disfluencies (fragmented words, restarts and vocalic supports) in spontaneous Spanish is presented based on a hand-annotated corpus. A quantitative account of differences among three types of registers (formal, informal and media) and several subtypes of text for each register is provided to analyze the importance of each disfluency class for a given register

Biblos-e Archivo

Biomedical Term Extraction: NLP Techniques in Computational Medicine

Author: Campillos Llanos Leonardo
Díaz Julia
Moreno Sandoval Antonio
Redondo Teófilo
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 14/02/2022
Field of study

Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

Re-UNIR

Construcción de un corpus comparable y un recurso de referencia para la simplificación de textos médicos en español

Author: Campillos Llanos Leonardo
Capllonch-Carrión Adrián
Terroba Reinares Ana R.
Valverde Ana
Zakhir Puig Sofía
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/09/2022
Field of study

We report the collection of the CLARA-MeD comparable corpus, which is made up of 24 298 pairs of professional and simplified texts in the medical domain for the Spanish language (>96M tokens). Texts types range from drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words), abstracts of systematic reviews (8138 pairs of texts, >9M words), cancer-related information summaries (201 pairs of texts, >3M tokens) and clinical trials announcements (5748 pairs of texts, 451 690 words). We also report the alignment of professional and simplified sentences, conducted manually by pairs of annotators. A subset of 3800 sentence pairs (149 862 tokens) has been aligned each by 2 experts, with an average inter-annotator agreement kappa score of 0.839 (0.076). The data are available in the community and contributes with a new benchmark to develop and evaluate automatic medical text simplification systems.Se describe la recogida del corpus comparable CLARA-MeD, formado por 24 298 pares de textos profesionales y simplificados de dominio médico en lengua española (>96M palabras). Los tipos de textos varían desde prospectos médicos y fichas técnicas de medicamentos (10 211 pares de textos, >82M palabras), resúmenes de revisiones sistemáticas (8138 pares de textos, >9M palabras), resúmenes de información sobre el cáncer (201 pares de textos, >3M palabras) y anuncios de ensayos clínicos (5748 pares de textos, 451 690 palabras). También presentamos el alineamiento de frases técnicas y simplificadas, realizado a mano por pares de anotadores. Un subconjunto de 3800 pares de frases (149 862 tokens) se han emparejado, con un acuerdo medio entre anotadores con valor kappa = 0.839 (0.076). Los datos están disponibles en la comunidad y este nuevo recurso permite desarrollar y evaluar sistemas de simplificación automática de textos médicos.Project CLARA-MED (PID2020-116001RA-C33) funded by MCIN/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos Investigación”

Repositorio Institucional de la Universidad de Alicante

Report on reusable documents as language resources in Spain, under the Government Plan for Language Technologies

Author: Campillos Llanos Leonardo
Moreno Sandoval Antonio
Torre Toledano Doroteo
Valverde Ana
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2019
Field of study

Este estudio ha sido realizado dentro del ámbito del Plan de impulso de las Tecnologías del Lenguaje (Plan TL) con financiación de la Secretaría de Estado para el Avance Digital y Red.es. Los objetivos centrales son realizar un censado de recursos de las diferentes administraciones públicas que puedan ser convertidos en RL, así como proponer un plan de acción para abordar su conversión en RL. Se ha elaborado una metodología específica para el censado y evaluación de la madurez de los datos. Se han generado dos listados, uno preliminar compuesto por 101 recursos, del que se han seleccionado 24 para su análisis detallado y evaluación. El informe también incluye un repaso de estudios similares en otros países. Concluye con unas recomendaciones genéricas, así como estrategias concretas para los recursos seleccionados. El informe final y los listados están disponibles públicamente en Red.es y la página del Plan TL.This report was carried out within the Spanish administration-driven initiative Language Technologies Plan (Plan TL), funded by Secretaría de Estado para el Avance Digital and Red.es. The main goals are collecting from Spanish public administrations a listing of provided resources and open data that can be transformed to language resources, as well as proposing an action plan to process and distribute them. We designed a specific methodology for listing and evaluating the degree of maturity of the considered data. We created two listings: a preliminary collection of 101 resources, and 24 resources and data repositories selected from the first list for a detailed analysis and evaluation. This report also features a comparative analysis of similar initiatives and studies conducted abroad. We conclude with generic recommendations and detailed strategies for the selected resources. The report and listings are publicly available at Red.es and the Plan TL. website.Este informe ha sido financiado por la Secretaría de Estado para el Avance Digital (SEAD) y Red.es

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Jornada de Grandes infraestructuras europeas de Ciencias Sociales y Humanidades en el CSIC: DARIAH y CLARÍN en el horizonte (11 de mayo de 2023)

Author: Armada Xosé-Lois
Baratech Soriano Covadonga
Berenguer Sánchez José Antonio
Calahorra Bartolomé Alfredo
Campillos-Llanos Leonardo
Castaño Javier
Corsini Alberto
Crespo Solana Ana
Delgado Gómez-Escalonilla Lorenzo
Farré Vidal Judith
Garcia Bueno Carmen
García Moreno Aitor
Giménez Toledo Elea
Guerrero Enterría Arturo
Gómez Rabal Ana
Jular Pérez-Alfaro Cristina
Madrid Álvarez-Piñer Teresa
Martín-Rodilla Patricia
Molas Gallart Jordi
Molina Martos Manuel
Murga Castro Idoia
Naranjo Orovio Consuelo
Pérez Martín Inmaculada
Ramiro Fariñas Diego
Riaño Rufilanchas Daniel
Robles Pérez María
Rollet Nádege
Ros-Fábregas Emilio
Sabater-Mir Jordi
Sanz-Cañada Javier
Smid Katja
Sánchez García Patricia
Thiele Jan
Torre Sainz Ignacio de la
Torrens Álvarez Mª Jesús
Valenzuela-Lamas Silvia
Publication venue
Publication date: 07/06/2023
Field of study

Peer reviewe

Digital.CSIC

Lexical errors in non-native oral Spanish: a corpus-based error analysis

Author: Campillos Llanos Leonardo
Publication venue: 'Universidad de Alicante Servicio de Publicaciones'
Publication date: 01/01/2014
Field of study

Se analizan los errores léxico-semánticos en la producción oral de cuarenta aprendices de español. Los datos pertenecen a un corpus de interlengua compuesto por entrevistas con universitarios de más de 9 lenguas maternas y de nivel intermedio (A2 y B1, Marco Común Europeo de Referencia). La metodología es la investigación en corpus de estudiantes, en concreto el análisis de errores asistido por ordenador. Los resultados muestran que los errores formales son más frecuentes en A2, y que no abundan los semánticos, pero persisten aumentando ligeramente en B1.This study analyses the lexical errors in the oral production of forty learners of Spanish. Data belong to a learner corpus of oral interviews with university learners from over nine language backgrounds at intermediate level: A2 (N=20) and B1 (N=20) (Common European Framework of Reference). Our methodology is Learner Corpus Research, specifically Computer-aided Error Analysis. The results in our corpus show that formal errors are more frequent at A2 level, whereas semantic errors, although being less abundant, persist and slightly increase at B1.Trabajo financiado por la Comunidad de Madrid y el Fondo Social Europeo mediante un contrato de investigación predoctoral

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Medical Lexicon for Spanish (MedLexSp)

Author: Campillos-Llanos Leonardo
Publication venue: DIGITAL.CSIC
Publication date: 25/05/2022
Field of study

- MedLexSp.dsv: a delimiter-separated value file, with the following data fields: Field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the part-of-speech; field 5, the semantic types(s); and field 6, the semantic group. - MedLexSp.xml: an XML-encoded version using the Lexical Markup Framework (LMF), which includes the morphological data (number, gender, verb tense and person, and information about affix/abbreviation data). The Document Type Definition file is also provided (lmf.dtd). - Lexical Record files: in subfolder "LR/": · LR_abr.dsv: list of equivalences between acronyms/abbreviations and full forms. · LR_affix.dsv: provides the equivalence between affixes/roots and their meanings. · LR_n_v.dsv: list of deverbal nouns. · LR_adj_n.dsv: list of adjectives derived from nouns. - Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py - Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.ptFile List: 1) MedLexSp.dsv; 2) MedLexSp.xml and lmf.dtd (Document Type Definition); 3) Lexical Record files: in subfolder "LR/": 3.1) LR_abr.dsv; 3.2) LR_affix.dsv; 3.3) LR_n_v.dsv; 3.4) LR_adj_n.dsv; 4) Spacy lemmatizer (in subfolder "spacy_lemmatizer/"): lemmatizer.py 5) Stanza lemmatizer (in subfolder "stanza_lemmatizer/"): ancora-medlexsp.pt See more information about the format below. Companion code and files can be found in the github repository: https://github.com/lcampillos/MedLexSpMedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes terms and inflected word forms with part-of-speech information and Unified Medical Language System (UMLS) semantic types, groups and Concept Unique Identifiers (CUIs). To create it, we used Natural Language Processing techniques and domain corpora (e.g. MedlinePlus). We also collected terms from the Dictionary of Medical Terms from the Spanish Royal Academy of Medicine, the Medical Subject Headings (MeSH), the Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT), the Medical Dictionary for Regulatory Activities Terminology (MedDRA), the International Classification of Diseases vs 10, the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary, the Online Mendelian Inheritance in Man (OMIM) and OrphaData. Terms related to COVID-19 were assembled by applying a similarity-based approach with word embeddings trained on a large corpus. This dataset was collected during the NLPMedTerm project and the CLARA-MeD project, with the goal of creating a lexical resource for medical text processing in the Spanish language.MedLexSp is an unified medical lexicon for Medical Natural Language Processing in Spanish. It includes 100 887 lemmas, 302 543 inflected forms (conjugated verbs, and number/gender variants), and 42 958 Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs).Spain, Latin America and United States of America (data from MedlinePlus Spanish and the Spanish version of the National Cancer Institute Dictionary of Medical Terms).This dataset was collected in the NLPMedTerm project, funded by the European Union’s Horizon 2020 research programme under the Marie Skodowska-Curie grant agreement nº. 713366 (InterTalentum UAM), and the CLARA-MeD project (PID2020-116001RA-C33), funded by MCIN/AEI/10.13039/501100011033/, in project call: "Proyectos I+D+i Retos Investigación".Peer reviewe

Digital.CSIC