9 research outputs found

    A survey of automatic term extraction for Brazilian Portuguese

    Get PDF
    Background: Term extraction is highly relevant as it is the basis for several tasks, such as the building of dictionaries, taxonomies, and ontologies, as well as the translation and organization of text data. \ud Methods and Results: In this paper, we present a survey of the state of the art in automatic term extraction (ATE) for the Brazilian Portuguese language. In this sense, the main contributions and projects related to such task have been classified according to the knowledge they use: statistical, linguistic, and hybrid (statistical and linguistic). We also present a study/review of the corpora used in the term extraction in Brazilian Portuguese, as well as a geographic mapping of Brazil regarding such contributions, projects, and corpora, considering their origins. \ud Conclusions: In spite of the importance of the ATE, there are still several gaps to be filled, for instance, the lack of consensus regarding the formal definition of meaning of ‘term’. Such gaps are larger for the Brazilian Portuguese when compared to other languages, such as English, Spanish, and French. Examples of gaps for Brazilian Portuguese include the lack of a baseline ATE system, as well as the use of more sophisticated linguistic information, such as the WordNet and Wikipedia knowledge bases. Nevertheless, there is an increase in the number of contributions related to ATE and an interesting tendency to use contrasting corpora and domain stoplists, even though most contributions only use frequency, noun phrases, and morphosyntactic patterns.Sao Paulo Research Foundation (FAPESP) (Grants 2009/16142-3, 2011/19850-9, 2012/03071-3, and 2012/09375-4)National Counsel of Technological and Scientific Development (CNPq

    Evaluation of cutoff policies for term extraction

    Get PDF

    Validação de termos de domínio por meio de uma base lexical-semântica difusa

    Get PDF
    A extração ou reconhecimento de termos pesquisa um corpus para prover uma lista de termos específicos de domínio a fim de ser usada em trabalhos mais avançados tais como a construção de terminologias e ontologias. Tanto medidas estatísticas quanto técnicas do Processamento da Linguagem Natural (PLN) têm sido investigadas para melhorar o desempenho na precisão das listas recuperadas. Não obstante, para manter a abrangência alta, as listas contêm falsos positivos. Para validar os candidatos como verdadeiros positivos, os termos têm de ser avaliados quer manualmente, quer automaticamente, por contraste com recursos externos, nomeadamente glossários específicos. Apresentamos uma série de experiências que mostram como uma base de conhecimento lexical pode melhorar o desempenho destes glossários de modo significativo. Partimos de uma lista de 50 candidatos a termos de domínio com precisão de 52%. Por meio da uma base lexical difusa, em que as palavras são agrupadas com um valor de associação semântica, achamos valores de corte para atingir percentagens de 100% tanto na precisão quanto na abrangência sobre a lista de partida, mantendo o valor da medida-F > 80%, com melhor resultado em 90%. Concluímos que, considerando que é necessário mais trabalho na pesquisa de limites e diferentes cenários, uma base lexical difusa pode melhorar o estado da arte das abordagens convencionais da extração automática de termos.Term extraction or recognition searches a given corpus to provide a list of domain specific terms for further use in more advanced tasks as in terminology and ontology building. Several statistical measures and Natural Language Processing techniques have been researched to improve precision of retrieved lists. However, to keep recall high, lists contain a number of false positives. To validate candidates as true positives in the domain, terms have to be manually evaluated or automatically checked against external resources such as specialized glossaries. Starting with a baseline of 50 candidate terms with 52% precision, we perform a series of experiments to show that a lexical knowledge base can significantly improve glossary performance. Furthermore, using a fuzzy lexical base, words clustered by a semantic association value, we research cutting points to reach 100% rates for either precision or recall for the baseline list, while keeping F-Measure > 80%, achieving 90% as best result. We conclude that, considering further research for limits and different case scenarios is also needed, a fuzzy lexical base can improve current state-of-the art approaches in automatic term extraction

    The underpinnings of a composite measure for automatic term extraction: The case of SRC

    Full text link
    The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921

    The main challenge of semi-automatic term extraction methods

    Get PDF
    Term extraction is the basis for many tasks such as building of taxonomies, ontologies and dictionaries, for translation, organization and retrieval of textual data. This paper studies themain challenge of semi-automatic termextraction methods, which is the difficulty to analyze the rank of candidates created by these methods. With the experimental evaluation performed in this work, it is possible to fairly compare a wide set of semi-automatic termextraction methods, which allows other future investigations. Additionally, we discovered which level of knowledge and threshold should be adopted for these methods in order to obtain good precision or F-measure. The results show there is not a unique method that is the best one for the three used corpora.Sao Paulo Research Foundation (FAPESP) (Grants 2009/16142-3

    DEXTER: A workbench for automatic term extraction with specialized corpora

    Full text link
    [EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

    Representación formal de mejores prácticas de IoT con base en los elementos del núcleo de la Esencia SEMAT

    Get PDF
    Internet de las Cosas (IoT) es una tecnología que consta de una serie de entidades interconectadas (objetos físicos inteligentes, servicios y sistemas de software) que trabajan de manera coordinada. Con ellas se busca simplificar y mejorar la eficiencia de los procesos buscando una mejor calidad de vida para las personas. En la literatura especializada se encontró que existen prácticas para desarrollar sistemas IoT que utilizan modelos monolíticos de Ingeniería de Software y que no son fáciles de implementar. Es necesario plantear una base común a través de una representación explícita que permita abarcar todas las problemáticas que puedan resultar al tratar de implementar estas prácticas. El objetivo de este proyecto es formalizar algunas de las mejores prácticas de IoT utilizando la extracción terminológica y teniendo como base de representación el núcleo de la Esencia de SEMAT (Software Engineering Method and Theory), el cual permite describir una base común liberando a las prácticas de las limitaciones de los métodos monolíticos. Esto permitirá a los equipos de implementación de sistemas IoT visualizar el progreso de las actividades independientemente de los métodos de trabajo, también permitirá compartir, adaptar, conectar y reproducir prácticas para crear nuevas formas de trabajo que ayudará a los desarrolladores a reutilizar sus conocimientos de forma sistemática y a los ejecutivos a dirigir programas y proyectos IoT con una mejor calidad que permitan reducir costos.Internet of Things (IoT) is a technology that consists of a series of interconnected entities (intelligent physical objects, services and software systems) that work in a coordinated manner. They seek to simplify and improve the efficiency of processes seeking a better quality of life for people. In the specialized literature, it was found that there are practices to develop IoT systems that use monolithic Software Engineering models and that are not easy to implement. It is necessary to establish a common base through a clean representation that allows covering all the problems that may result when trying to implement these practices. The objective of this project is to formalize some of the best practices of IoT using terminological extraction and having as a basis of representation the core of the Essence of SEMAT (Software Engineering Method and Theory) which allows to describe a common base freeing the practices of the limitations of monolithic methods. This will allow IoT system implementation teams to visualize the progress of activities regardless of work methods, it will also allow sharing, adapting, connecting and reproducing practices to create new ways of working that will help developers to systematically reuse their knowledge in a new way and executives to direct IoT programs and projects with better quality that reduce costs.MaestríaMagíster en Ingeniería de Sistemas y ComputaciónTabla de Contenido Pág. Resumen....................................................................................................................................... 16 Abstract........................................................................................................................................ 17 Introducción ................................................................................................................................ 18 Capítulo I: Marco Teórico ......................................................................................................... 21 1.1. Internet de las Cosas (IoT)..................................................................................................... 21 1.1.1. Arquitectura IoT.................................................................................................................. 21 1.1.1.1. Capa de percepción.......................................................................................................... 21 1.1.1.2. Capa de red ...................................................................................................................... 21 1.1.1.3. Capa de aplicación ........................................................................................................... 22 1.1.2. Aplicaciones de IoT............................................................................................................ 22 1.2. Ingeniería de Software ........................................................................................................... 22 1.2.1. Núcleo de la Esencia de SEMAT........................................................................................ 22 1.2.1.1. Elementos del Núcleo de la Esencia de SEMAT............................................................. 23 1.3. Buenas Prácticas .................................................................................................................... 29 1.3.1. Nombramiento correcto de buenas prácticas...................................................................... 29 1.4. Procesamiento del Lenguaje Natural (PLN).......................................................................... 31 1.4.1. Extracción Terminológica................................................................................................... 31 1.5. Revisión Sistemática de Literatura (RSL) ............................................................................. 33 1.6. Mapeo Sistemático de Literatura (MSL) ............................................................................... 33 1.7. Grupos focales ....................................................................................................................... 34 Capítulo II: Estado del Arte ...................................................................................................... 35 Capítulo III: Planteamiento del Problema y Objetivos........................................................... 38 3.1. Descripción del Problema ...................................................................................................... 38 7 3.2. Formulación del Problema..................................................................................................... 38 3.3. Justificación ........................................................................................................................... 39 3.4. Objetivos................................................................................................................................ 41 3.4.1. Objetivo General................................................................................................................. 41 3.4.2. Objetivos Específicos.......................................................................................................... 41 Capítulo IV: Metodología .......................................................................................................... 42 4.1. Revisión Sistemática de Literatura (RSL) ............................................................................. 42 4.1.1. Planeación........................................................................................................................... 42 4.1.1.1. Definición de las Preguntas de la Investigación .............................................................. 43 4.1.2. Búsqueda Primaria .............................................................................................................. 43 4.1.2.1. Especificación del Tipo de Búsqueda .............................................................................. 43 4.1.2.2. Selección de las Fuentes de Información......................................................................... 44 4.1.2.3. Definición de las Cadenas de Búsqueda .......................................................................... 44 4.1.3. Selección Preliminar........................................................................................................... 44 4.1.3.1. Eliminación de Documentos Irrelevantes........................................................................ 44 4.1.3.2. Eliminación de Documentos Duplicados......................................................................... 44 4.1.4. Selección............................................................................................................................. 45 4.1.4.1. Definición de criterios de inclusión ................................................................................. 45 4.1.4.2. Definición de criterios de exclusión ................................................................................ 45 4.1.5. Extracción de Datos............................................................................................................ 45 4.1.5.1. Definición de Criterios de Calidad .................................................................................. 45 4.1.5.2. Extracción de Datos de cada Documento ........................................................................ 45 4.1.6. Análisis ............................................................................................................................... 45 4.2. Relación de los Componentes de Mejores Prácticas en IoT con los elementos del núcleo de la Esencia ..................................................................................................................................... 45 8 4.2.1. Selección de algunas de las Mejores Prácticas en IoT........................................................ 46 4.2.2. Construcción del Vocabulario de Términos de IoT............................................................ 46 4.2.2.1. Mapeo Sistemático de Literatura (MSL) ......................................................................... 46 4.2.2.2. Construcción del Extractor Automático de Términos ..................................................... 48 4.2.2.3. Validación del Extractor Automático de Términos......................................................... 48 4.2.2.4. Extracción del Vocabulario con el Extractor Automático de Términos.......................... 49 4.2.3. Selección de los Nombres para Mejores Prácticas en IoT.................................................. 49 4.2.4. Tabulación de Componentes de Prácticas IoT con Elementos del Núcleo de la Esencia... 49 4.3. Modelado de Mejores Prácticas en IoT con el Núcleo de la Esencia .................................... 49 4.4. Validación de los Modelos de Mejores Prácticas en IoT....................................................... 51 4.4.1. Planeación del Grupo Focal................................................................................................ 51 4.4.2. Desarrollo del Grupo Focal................................................................................................. 52 4.4.3. Análisis de Datos y Reporte de Resultados ........................................................................ 53 Capítulo V: Desarrollo de la Tesis............................................................................................. 54 5.1. Revisión Sistemática de Literatura (RSL) en IoT.................................................................. 54 5.1.1. Conclusiones de la Revisión Sistemática de Literatura ...................................................... 55 5.2. Relación de los Componentes de Mejores Prácticas en IoT con los elementos del núcleo de la Esencia ...................................................................................................................................... 57 5.2.1. Selección de algunas de las Mejores Prácticas en IoT........................................................ 57 5.2.2. Construcción del Vocabulario de Términos de IoT............................................................ 58 5.2.2.1. Mapeo Sistemático de Literatura (MSL) ......................................................................... 59 5.2.2.2. Construcción del Extractor Automático de Términos ..................................................... 72 5.2.2.3. Validación del Extractor Automático de Términos......................................................... 88 5.2.2.4. Extracción del Vocabulario con el Extractor Automático de Términos.......................... 89 5.2.3. Selección de los Nombres para Mejores Prácticas en IoT.................................................. 89 9 5.2.4. Tabulación de Componentes de Prácticas IoT con el Núcleo de la Esencia ...................... 90 5.3. Modelado de Mejores Prácticas en IoT con el Núcleo de la Esencia .................................. 100 5.4. Validación de los Modelos de Mejores Prácticas en IoT..................................................... 110 5.4.1. Planeación del Grupo Focal.............................................................................................. 110 5.4.1.1. Definición del Objetivo.................................................................................................. 110 5.4.1.2. Identificación de los Participantes................................................................................. 111 5.4.1.3. Programación de la Reunión.......................................................................................... 111 5.4.1.4. Preparación de los Materiales del Grupo Focal ............................................................. 111 5.4.1.5. Enviar Recordatorio a los Participantes......................................................................... 112 5.4.2. Desarrollo del Grupo Focal............................................................................................... 112 5.4.2.1. Presentación de los Participantes................................................................................... 112 5.4.2.2. Grabación de la Reunión................................................................................................ 112 5.4.2.3. Entrega de Materiales .................................................................................................... 112 5.4.2.4. Presentación del Grupo Focal ........................................................................................ 113 5.4.2.5. Discusión y Evaluación de los Modelos........................................................................ 113 5.4.2.6. Finalización de la Reunión............................................................................................. 113 5.4.3. Análisis de Datos y Reporte de Resultados ...................................................................... 113 5.4.3.1. Resultados de Validación de la Práctica 1 ..................................................................... 113 5.4.3.2. Resultados de Validación de la Práctica 2 ..................................................................... 114 5.4.3.3. Resultados de Validación de la Práctica 3 ..................................................................... 114 5.4.3.4. Resultados de Validación de la Práctica 4 ..................................................................... 115 5.4.3.5. Resultados de Validación de la Práctica 5 ..................................................................... 115 5.4.3.6. Resultados de Validación de la Práctica 6 ..................................................................... 116 5.4.3.7. Resultados de Validación de la Práctica 7 ..................................................................... 116 10 5.4.3.8. Resultados de Validación de la Práctica 8 ..................................................................... 117 5.4.3.9. Resultados de Validación de la Práctica 9 ..................................................................... 117 5.4.3.10. Resultados de Validación de la Práctica 10 ................................................................. 118 5.4.3.11. Conclusiones de la Validación de los Modelos ........................................................... 118 Capítulo VI: Conclusiones y Trabajo Futuro ........................................................................ 120 6.1. Conclusiones........................................................................................................................ 120 6.2. Cumplimiento de Objetivos................................................................................................. 121 6.3. Trabajos Futuros .................................................................................................................. 124 Referencias ................................................................................................................................ 125 Anexos........................................................................................................................................ 15
    corecore