13 research outputs found

    Opinion Mining on Small and Noisy Samples of Health-Related Texts

    Get PDF
    The topic of people’s health has always attracted the attention of public and private structures, the patients themselves and, therefore, researchers. Social networks provide an immense amount of data for analysis of health- related issues; however it is not always the case that researchers have enough data to build sophisticated models. In the paper, we artificially create this lim- itation to test performance and stability of different popular algorithms on small samples of texts. There are two specificities in this research apart from the size of a sample: (a) here, instead of usual 5-star classification, we use combined classes reflecting a more practical view on medicines and treatments; (b) we consider both original and noisy data. The experiments were carried out using data extracted from the popular forum AskaPatient. For tuning parameters, GridSearchCV technique was used. The results show that in dealing with small and noisy data samples, GMDH Shell is superior to other methods. The work has a practical orientation

    RealText-cs - Corpus based domain independent Content Selection model

    Get PDF
    Content selection is a highly domain dependent task responsible for retrieving relevant information from a knowledge source using a given communicative goal. This paper presents a domain independent content selection model using keywords as communicative goal. We employ DBpedia triple store as our knowledge source and triples are selected based on weights assigned to each triple. The calculation of the weights is carried out through log likelihood distance between a domain corpus and a general reference corpus. The method was evaluated using keywords extracted from QALD dataset and the performance was compared with cross entropy based statistical content selection. The evaluation results showed that the proposed method can perform 32% better than cross entropy based statistical content selection

    Identificación de términos a partir de enumeraciones sintagmáticas nominales: una aplicación al dominio médico

    Get PDF
    Partiendo de la hipótesis de que las enumeraciones sintagmáticas nominales (ESN) que se encuentran en los textos médicos se componen de términos específicos del dominio, presentamos un método de reconocimiento de dichas enumeraciones con el objetivo de contribuir a la extracción automática. La metodología se conforma de tres etapas: (i) reconocimiento de enumeraciones sintagmáticas nominales, aquí se utiliza exclusivamente información lingüística, a partir de la cual se elaboran reglas de análisis sintáctico; (ii) extracción automática de los candidatos a términos que se correspondían con unigramas y bigramas, y (iii) evaluación de los candidatos extraídos con el asesoramiento de expertos del área médica. Los experimentos fueron realizados en el corpus IULA, conformado por textos médicos en español. Los resultados obtenidos fueron alentadores, ya que se logró un 67% y 68% de precisión en las enumeraciones detectadas para unigramas y bigramas respectivamente.Sociedad Argentina de Informática e Investigación Operativ

    Comparative analysis of TF-IDF and loglikelihood method for keywords extraction of twitter data

    Get PDF
    Twitter has become the foremost standard of social media in today’s world. Over 335 million users are online monthly, and near about 80% are accessing it through their mobiles. Further, Twitter is now supporting 35+ which enhance its usage too much. It facilitates people having different languages. Near about 21% of the total users are from US and 79% of total users are outside of US. A tweet is restricted to a hundred and forty characters; hence it contains such information which is more concise and much valuable. Due to its usage, it is estimated that five hundred million tweets are sent per day by different categories of people including teacher, students, celebrities, officers, musician, etc. So, there is a huge amount of data that is increasing on a daily basis that need to be categorized. The important key feature is to find the keywords in the huge data that is helpful for identifying a twitter for classification. For this purpose, Term Frequency-Inverse Document Frequency (TF-IDF) and Loglikelihood methods are chosen for keywords extracted from the music field and perform a comparative analysis on both results. In the end, relevance is performed from 5 users so that finally we can take a decision to make assumption on the basis of experiments that which method is best. This analysis is much valuable because it gives a more accurate estimation which method’s results are more reliable

    Corpora creation in contrastive linguistics (Cтворення корпусів у дослідженнях з зіставного мовознавства)

    Get PDF
    Universal and specific features of language usage can become more evident if tested against the non-elicited language data on large scale. This requirement can be met by using corpora that provide ample data to test research hypotheses in contrastive language studies in objective and falsifiable manner. However, criteria in corpora creation and comparability measures in the evaluation of available corpora present a separate problem in contrastive linguistics. The article presents an overview of the types of corpora used in Contrastive Linguistics research and describes their characteristic features. The study proceeds to look into the sources of data used in corpora creation both in (commercially) available corpora and data collections compiled to answer a particular research question. The article describes the techniques used in creating comparable corpora for contrastive studies and presents the comparability measures to evaluate the corpora. The study examines the case of building a topic-specific comparable corpus in English and Ukrainian. The corpus focuses on education-related vocabulary in the languages under analysis. The corpus comparability is measured using translation equivalence and word frequency similarity. The article used the procedures outlined above to collect a quasi-comparable (non-aligned) corpus focusing on the topic of education with the English and Ukrainian languages in contrast. Using frequency comparability measure it was established that both components of the corpus (in the English and Ukrainian languages) contain keywords related to the topic of education. (У статті проаналізовано типи корпусів, які використовуються у дослідженнях з зіставного мовознавства з метою виявлення універсальних та специфічних особливостей мов. Встановлено основні джерела матеріалів для укладання корпусів, критерії відбору текстів, етапи укладання корпусів, моделі оцінки та характеристики корпусів для контрастивних студій. У статті розглянуто методи, що використовуються у створенні корпусів для зіставних досліджень, описано досвід укладання корпусів для зіставних досліджень на матеріалі англійської та української мов. Критерії відбору матеріалу, етапи побудови корпусів та перспектив їх використання розглянуто на прикладі корпусів лексики сфери освіти в аналізованих мовах.

    Automatic Term Extraction Using Log-Likelihood Based Comparison with General Reference Corpus

    No full text

    Representación formal de mejores prácticas de IoT con base en los elementos del núcleo de la Esencia SEMAT

    Get PDF
    Internet de las Cosas (IoT) es una tecnología que consta de una serie de entidades interconectadas (objetos físicos inteligentes, servicios y sistemas de software) que trabajan de manera coordinada. Con ellas se busca simplificar y mejorar la eficiencia de los procesos buscando una mejor calidad de vida para las personas. En la literatura especializada se encontró que existen prácticas para desarrollar sistemas IoT que utilizan modelos monolíticos de Ingeniería de Software y que no son fáciles de implementar. Es necesario plantear una base común a través de una representación explícita que permita abarcar todas las problemáticas que puedan resultar al tratar de implementar estas prácticas. El objetivo de este proyecto es formalizar algunas de las mejores prácticas de IoT utilizando la extracción terminológica y teniendo como base de representación el núcleo de la Esencia de SEMAT (Software Engineering Method and Theory), el cual permite describir una base común liberando a las prácticas de las limitaciones de los métodos monolíticos. Esto permitirá a los equipos de implementación de sistemas IoT visualizar el progreso de las actividades independientemente de los métodos de trabajo, también permitirá compartir, adaptar, conectar y reproducir prácticas para crear nuevas formas de trabajo que ayudará a los desarrolladores a reutilizar sus conocimientos de forma sistemática y a los ejecutivos a dirigir programas y proyectos IoT con una mejor calidad que permitan reducir costos.Internet of Things (IoT) is a technology that consists of a series of interconnected entities (intelligent physical objects, services and software systems) that work in a coordinated manner. They seek to simplify and improve the efficiency of processes seeking a better quality of life for people. In the specialized literature, it was found that there are practices to develop IoT systems that use monolithic Software Engineering models and that are not easy to implement. It is necessary to establish a common base through a clean representation that allows covering all the problems that may result when trying to implement these practices. The objective of this project is to formalize some of the best practices of IoT using terminological extraction and having as a basis of representation the core of the Essence of SEMAT (Software Engineering Method and Theory) which allows to describe a common base freeing the practices of the limitations of monolithic methods. This will allow IoT system implementation teams to visualize the progress of activities regardless of work methods, it will also allow sharing, adapting, connecting and reproducing practices to create new ways of working that will help developers to systematically reuse their knowledge in a new way and executives to direct IoT programs and projects with better quality that reduce costs.MaestríaMagíster en Ingeniería de Sistemas y ComputaciónTabla de Contenido Pág. Resumen....................................................................................................................................... 16 Abstract........................................................................................................................................ 17 Introducción ................................................................................................................................ 18 Capítulo I: Marco Teórico ......................................................................................................... 21 1.1. Internet de las Cosas (IoT)..................................................................................................... 21 1.1.1. Arquitectura IoT.................................................................................................................. 21 1.1.1.1. Capa de percepción.......................................................................................................... 21 1.1.1.2. Capa de red ...................................................................................................................... 21 1.1.1.3. Capa de aplicación ........................................................................................................... 22 1.1.2. Aplicaciones de IoT............................................................................................................ 22 1.2. Ingeniería de Software ........................................................................................................... 22 1.2.1. Núcleo de la Esencia de SEMAT........................................................................................ 22 1.2.1.1. Elementos del Núcleo de la Esencia de SEMAT............................................................. 23 1.3. Buenas Prácticas .................................................................................................................... 29 1.3.1. Nombramiento correcto de buenas prácticas...................................................................... 29 1.4. Procesamiento del Lenguaje Natural (PLN).......................................................................... 31 1.4.1. Extracción Terminológica................................................................................................... 31 1.5. Revisión Sistemática de Literatura (RSL) ............................................................................. 33 1.6. Mapeo Sistemático de Literatura (MSL) ............................................................................... 33 1.7. Grupos focales ....................................................................................................................... 34 Capítulo II: Estado del Arte ...................................................................................................... 35 Capítulo III: Planteamiento del Problema y Objetivos........................................................... 38 3.1. Descripción del Problema ...................................................................................................... 38 7 3.2. Formulación del Problema..................................................................................................... 38 3.3. Justificación ........................................................................................................................... 39 3.4. Objetivos................................................................................................................................ 41 3.4.1. Objetivo General................................................................................................................. 41 3.4.2. Objetivos Específicos.......................................................................................................... 41 Capítulo IV: Metodología .......................................................................................................... 42 4.1. Revisión Sistemática de Literatura (RSL) ............................................................................. 42 4.1.1. Planeación........................................................................................................................... 42 4.1.1.1. Definición de las Preguntas de la Investigación .............................................................. 43 4.1.2. Búsqueda Primaria .............................................................................................................. 43 4.1.2.1. Especificación del Tipo de Búsqueda .............................................................................. 43 4.1.2.2. Selección de las Fuentes de Información......................................................................... 44 4.1.2.3. Definición de las Cadenas de Búsqueda .......................................................................... 44 4.1.3. Selección Preliminar........................................................................................................... 44 4.1.3.1. Eliminación de Documentos Irrelevantes........................................................................ 44 4.1.3.2. Eliminación de Documentos Duplicados......................................................................... 44 4.1.4. Selección............................................................................................................................. 45 4.1.4.1. Definición de criterios de inclusión ................................................................................. 45 4.1.4.2. Definición de criterios de exclusión ................................................................................ 45 4.1.5. Extracción de Datos............................................................................................................ 45 4.1.5.1. Definición de Criterios de Calidad .................................................................................. 45 4.1.5.2. Extracción de Datos de cada Documento ........................................................................ 45 4.1.6. Análisis ............................................................................................................................... 45 4.2. Relación de los Componentes de Mejores Prácticas en IoT con los elementos del núcleo de la Esencia ..................................................................................................................................... 45 8 4.2.1. Selección de algunas de las Mejores Prácticas en IoT........................................................ 46 4.2.2. Construcción del Vocabulario de Términos de IoT............................................................ 46 4.2.2.1. Mapeo Sistemático de Literatura (MSL) ......................................................................... 46 4.2.2.2. Construcción del Extractor Automático de Términos ..................................................... 48 4.2.2.3. Validación del Extractor Automático de Términos......................................................... 48 4.2.2.4. Extracción del Vocabulario con el Extractor Automático de Términos.......................... 49 4.2.3. Selección de los Nombres para Mejores Prácticas en IoT.................................................. 49 4.2.4. Tabulación de Componentes de Prácticas IoT con Elementos del Núcleo de la Esencia... 49 4.3. Modelado de Mejores Prácticas en IoT con el Núcleo de la Esencia .................................... 49 4.4. Validación de los Modelos de Mejores Prácticas en IoT....................................................... 51 4.4.1. Planeación del Grupo Focal................................................................................................ 51 4.4.2. Desarrollo del Grupo Focal................................................................................................. 52 4.4.3. Análisis de Datos y Reporte de Resultados ........................................................................ 53 Capítulo V: Desarrollo de la Tesis............................................................................................. 54 5.1. Revisión Sistemática de Literatura (RSL) en IoT.................................................................. 54 5.1.1. Conclusiones de la Revisión Sistemática de Literatura ...................................................... 55 5.2. Relación de los Componentes de Mejores Prácticas en IoT con los elementos del núcleo de la Esencia ...................................................................................................................................... 57 5.2.1. Selección de algunas de las Mejores Prácticas en IoT........................................................ 57 5.2.2. Construcción del Vocabulario de Términos de IoT............................................................ 58 5.2.2.1. Mapeo Sistemático de Literatura (MSL) ......................................................................... 59 5.2.2.2. Construcción del Extractor Automático de Términos ..................................................... 72 5.2.2.3. Validación del Extractor Automático de Términos......................................................... 88 5.2.2.4. Extracción del Vocabulario con el Extractor Automático de Términos.......................... 89 5.2.3. Selección de los Nombres para Mejores Prácticas en IoT.................................................. 89 9 5.2.4. Tabulación de Componentes de Prácticas IoT con el Núcleo de la Esencia ...................... 90 5.3. Modelado de Mejores Prácticas en IoT con el Núcleo de la Esencia .................................. 100 5.4. Validación de los Modelos de Mejores Prácticas en IoT..................................................... 110 5.4.1. Planeación del Grupo Focal.............................................................................................. 110 5.4.1.1. Definición del Objetivo.................................................................................................. 110 5.4.1.2. Identificación de los Participantes................................................................................. 111 5.4.1.3. Programación de la Reunión.......................................................................................... 111 5.4.1.4. Preparación de los Materiales del Grupo Focal ............................................................. 111 5.4.1.5. Enviar Recordatorio a los Participantes......................................................................... 112 5.4.2. Desarrollo del Grupo Focal............................................................................................... 112 5.4.2.1. Presentación de los Participantes................................................................................... 112 5.4.2.2. Grabación de la Reunión................................................................................................ 112 5.4.2.3. Entrega de Materiales .................................................................................................... 112 5.4.2.4. Presentación del Grupo Focal ........................................................................................ 113 5.4.2.5. Discusión y Evaluación de los Modelos........................................................................ 113 5.4.2.6. Finalización de la Reunión............................................................................................. 113 5.4.3. Análisis de Datos y Reporte de Resultados ...................................................................... 113 5.4.3.1. Resultados de Validación de la Práctica 1 ..................................................................... 113 5.4.3.2. Resultados de Validación de la Práctica 2 ..................................................................... 114 5.4.3.3. Resultados de Validación de la Práctica 3 ..................................................................... 114 5.4.3.4. Resultados de Validación de la Práctica 4 ..................................................................... 115 5.4.3.5. Resultados de Validación de la Práctica 5 ..................................................................... 115 5.4.3.6. Resultados de Validación de la Práctica 6 ..................................................................... 116 5.4.3.7. Resultados de Validación de la Práctica 7 ..................................................................... 116 10 5.4.3.8. Resultados de Validación de la Práctica 8 ..................................................................... 117 5.4.3.9. Resultados de Validación de la Práctica 9 ..................................................................... 117 5.4.3.10. Resultados de Validación de la Práctica 10 ................................................................. 118 5.4.3.11. Conclusiones de la Validación de los Modelos ........................................................... 118 Capítulo VI: Conclusiones y Trabajo Futuro ........................................................................ 120 6.1. Conclusiones........................................................................................................................ 120 6.2. Cumplimiento de Objetivos................................................................................................. 121 6.3. Trabajos Futuros .................................................................................................................. 124 Referencias ................................................................................................................................ 125 Anexos........................................................................................................................................ 15

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Get PDF
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Provision of better VLE learner support with a Question Answering System

    Get PDF
    The focus of this research is based on the provision of user support to students using electronic means of communication to aid their learning. Digital age brought anytime anywhere access of learning resources to students. Most academic institutions and also companies use Virtual Learning Environments to provide their learners with learning material. All learners using the VLE have access to the same material and help despite their existing knowledge and interests. This work uses the information in the learning materials of Virtual Learning Environments to answer questions and provide student help by a Question Answering System. The aim of this investigation is to research if a satisfactory combination of Question Answering, Information Retrieval and Automatic Summarisation techniques within a VLE will help/support the student better than existing systems (full text search engines)
    corecore