43 research outputs found

    Creación de datos multilingües para diversos enfoques basados en corpus en el ámbito de la traducción y la interpretación

    Get PDF
    Accordingly, this research work aims at exploiting and developing new technologies and methods to better ascertain not only translators’ and interpreters’ needs, but also professionals’ and ordinary people’s on their daily tasks, such as corpora and terminology compilation and management. The main topics covered by this work relate to Computational Linguistics (CL), Natural Language Processing (NLP), Machine Translation (MT), Comparable Corpora, Distributional Similarity Measures (DSM), Terminology Extraction Tools (TET) and Terminology Management Tools (TMT). In particular, this work examines three main questions: 1) Is it possible to create a simpler and user-friendly comparable corpora compilation tool? 2) How to identify the most suitable TMT and TET for a given translation or interpreting task? 3) How to automatically assess and measure the internal degree of relatedness in comparable corpora? This work is composed of thirteen peer-reviewed scientific publications, which are included in Appendix A, while the methodology used and the results obtained in these studies are summarised in the main body of this document. Fecha de lectura de Tesis Doctoral: 22 de noviembre 2019Corpora are playing an increasingly important role in our multilingual society. High-quality parallel corpora are a preferred resource in the language engineering and the linguistics communities. Nevertheless, the lack of sufficient and up-to-date parallel corpora, especially for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement across various areas like translation, language learning and, automatic and assisted translation. An alternative is the use of comparable corpora, which are easier and faster to compile. Corpora, in general, are extremely important for tasks like translation, extraction, inter-linguistic comparisons and discoveries or even to lexicographical resources. Its objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages over other types of limited resources like thesauri or dictionaries. By a way of example, new terms are coined on a daily basis and dictionaries cannot keep up with the rate of emergence of new terms

    Anotación Automática de Imágenes Médicas Usando la Representación de Bolsa de Características

    Get PDF
    La anotación automática de imágenes médicas se ha convertido en un proceso necesario para la gestión, búsqueda y exploración de las crecientes bases de datos médicas para apoyo al diagnóstico y análisis de imágenes en investigación biomédica. La anotación automática consiste en asignar conceptos de alto nivel a imágenes a partir de las características visuales de bajo nivel. Para esto se busca tener una representación de la imagen que caracterice el contenido visual de ésta y un modelo de aprendizaje entrenado con ejemplos de imágenes anotadas. Este trabajo propone explorar la Bolsa de Características (BdC) para la representación de las imágenes de histología y los Métodos de Kernel (MK) como modelos de aprendizaje de máquina para la anotación automática. Adicionalmente se exploró una metodología de análisis de colecciones de imágenes para encontrar patrones visuales y sus relaciones con los conceptos semánticos usando Análisis de Información Mutua, Selección de Características con Máxima-Relevancia y Mínima-Redundancia (mRMR) y Análisis de Biclustering. La metodología propuesta fue evaluada en dos bases de datos de imágenes, una con imá- genes anotadas con los cuatro tejidos fundamentales y otra con imágenes de tipo de cáncer de piel conocido como carcinoma basocelular. Los resultados en análisis de imágenes revelan que es posible encontrar patrones implícitos en colecciones de imágenes a partir de la representación BdC seleccionan- do las palabras visuales relevantes de la colección y asociándolas a conceptos semánticos mientras que el análisis de biclustering permitió encontrar algunos grupos de imágenes similares que comparten palabras visuales asociadas al tipo de tinción o conceptos. En anotación automática se evaluaron distintas configuraciones del enfoque BdC. Los mejores resultados obtenidos presentan una Precisión de 91 % y un Recall de 88 % en las imágenes de histología, y una Precisión de 59 % y un Recall de 23 % en las imágenes de histopatología. La configuración de la metodología BdC con los mejores resultados en ambas colecciones fue obtenida usando las palabras visuales basadas en DCT con un diccionario de tamaño 1,000 con un kernel Gaussiano. / Abstract. The automatic annotation of medical images has become a necessary process for managing, searching and exploration of growing medical image databases for diagnostic support and image analysis in biomedical research. The automatic annotation is to assign high-level concepts to images from the low-level visual features. For this, is needed to have a image representation that characterizes its visual content and a learning model trained with examples of annotated images. This paper aims to explore the Bag of Features (BOF) for the representation of histology images and Kernel Methods (KM) as models of machine learning for automatic annotation. Additionally, we explored a methodology for image collection analysis in order to _nd visual patterns and their relationships with semantic concepts using Mutual Information Analysis, Features Selection with Max-Relevance and Min- Redundancy (mRMR) and Biclustering Analysis. The proposed methodology was evaluated in two image databases, the _rst have images annotated with the four fundamental tissues, and the second have images of a type of skin cancer known as Basal-cell carcinoma. The image analysis results show that it is possible to _nd implicit patterns in image collections from the BOF representation. This by selecting the relevant visual words in the collection and associating them with semantic concepts, whereas biclustering analysis allowed to _nd groups of similar images that share visual words associated with the type of stain or concepts. The Automatic annotation was evaluated in di_erent settings of BOF approach. The best results have a Precision of 91% and Recall of 88% in the histology images, and a Precision of 59% and Recall of 23% in histopathology images. The con_guration of BOF methodology with the best results in both datasets was obtained using the DCT-based visual words in a dictionary size of 1; 000 with a Gaussian kernel.Maestrí

    Extracção de informação médica em português europeu

    Get PDF
    Doutoramento em Engenharia InformáticaThe electronic storage of medical patient data is becoming a daily experience in most of the practices and hospitals worldwide. However, much of the data available is in free-form text, a convenient way of expressing concepts and events, but especially challenging if one wants to perform automatic searches, summarization or statistical analysis. Information Extraction can relieve some of these problems by offering a semantically informed interpretation and abstraction of the texts. MedInX, the Medical Information eXtraction system presented in this document, is the first information extraction system developed to process textual clinical discharge records written in Portuguese. The main goal of the system is to improve access to the information locked up in unstructured text, and, consequently, the efficiency of the health care process, by allowing faster and reliable access to quality information on health, for both patient and health professionals. MedInX components are based on Natural Language Processing principles, and provide several mechanisms to read, process and utilize external resources, such as terminologies and ontologies, in the process of automatic mapping of free text reports onto a structured representation. However, the flexible and scalable architecture of the system, also allowed its application to the task of Named Entity Recognition on a shared evaluation contest focused on Portuguese general domain free-form texts. The evaluation of the system on a set of authentic hospital discharge letters indicates that the system performs with 95% F-measure, on the task of entity recognition, and 95% precision on the task of relation extraction. Example applications, demonstrating the use of MedInX capabilities in real applications in the hospital setting, are also presented in this document. These applications were designed to answer common clinical problems related with the automatic coding of diagnoses and other health-related conditions described in the documents, according to the international classification systems ICD-9-CM and ICF. The automatic review of the content and completeness of the documents is an example of another developed application, denominated MedInX Clinical Audit system.O armazenamento electrónico dos dados médicos do paciente é uma prática cada vez mais comum nos hospitais e clínicas médicas de todo o mundo. No entanto, a maior parte destes dados são disponibilizados sob a forma de texto livre, uma forma conveniente de expressar conceitos e termos mas particularmente desafiante quando se pretende realizar procuras, sumarização ou análise estatística de uma forma automática. As tecnologias de extracção automática de informação podem ajudar a solucionar alguns destes problemas através da interpretação semântica e da abstracção do conteúdo dos textos. O sistema de Extracção de Informação Médica apresentado neste documento, o MedInX, é o primeiro sistema desenvolvido para o processamento de cartas de alta hospitalar escritas em Português. O principal objectivo do sistema é a melhoria do acesso à informação trancada nos textos e, consequentemente, a melhoria da eficiência dos cuidados de saúde, através do acesso rápido e confiável à informação, quer relativa ao doente, quer aos profissionais de saúde. O MedInX utiliza diversas componentes, baseadas em princípios de processamento de linguagem natural, para a análise dos textos clínicos, e contém vários mecanismos para ler, processar e utilizar recursos externos, como terminologias e ontologias. Este recursos são utilizados, em particular, no mapeamento automático do texto livre para uma representação estruturada. No entanto, a arquitectura flexível e escalável do sistema permitiu, também, a sua aplicação na tarefa de Reconhecimento de Entidades Nomeadas numa avaliação conjunta relativa ao processamento de textos de domínio geral, escritos em Português. A avaliação do sistema num conjunto de cartas de alta hospitalar reais, indica que o sistema realiza a tarefa de extracção de informação com uma medida F de 95% e a tarefa de extracção de relações com uma precisão de 95%. A utilidade do sistema em aplicações reais é demonstrada através do desenvolvimento de um conjunto de projectos exemplificativos, que pretendem responder a problemas concretos e comuns em ambiente hospitalar. Estes problemas estão relacionados com a codificação automática de diagnósticos e de outras condições relacionadas com o estado de saúde do doente, seguindo as classificações internacionais, ICD-9-CM e ICF. A revisão automática do conteúdo dos documentos é outro exemplo das possíveis aplicações práticas do sistema. Esta última aplicação é representada pelo o sistema de auditoria do MedInX

    Situation inference and context recognition for intelligent mobile sensing applications

    Get PDF
    The usage of smart devices is an integral element in our daily life. With the richness of data streaming from sensors embedded in these smart devices, the applications of ubiquitous computing are limitless for future intelligent systems. Situation inference is a non-trivial issue in the domain of ubiquitous computing research due to the challenges of mobile sensing in unrestricted environments. There are various advantages to having robust and intelligent situation inference from data streamed by mobile sensors. For instance, we would be able to gain a deeper understanding of human behaviours in certain situations via a mobile sensing paradigm. It can then be used to recommend resources or actions for enhanced cognitive augmentation, such as improved productivity and better human decision making. Sensor data can be streamed continuously from heterogeneous sources with different frequencies in a pervasive sensing environment (e.g., smart home). It is difficult and time-consuming to build a model that is capable of recognising multiple activities. These activities can be performed simultaneously with different granularities. We investigate the separability aspect of multiple activities in time-series data and develop OPTWIN as a technique to determine the optimal time window size to be used in a segmentation process. As a result, this novel technique reduces need for sensitivity analysis, which is an inherently time consuming task. To achieve an effective outcome, OPTWIN leverages multi-objective optimisation by minimising the impurity (the number of overlapped windows of human activity labels on one label space over time series data) while maximising class separability. The next issue is to effectively model and recognise multiple activities based on the user's contexts. Hence, an intelligent system should address the problem of multi-activity and context recognition prior to the situation inference process in mobile sensing applications. The performance of simultaneous recognition of human activities and contexts can be easily affected by the choices of modelling approaches to build an intelligent model. We investigate the associations of these activities and contexts at multiple levels of mobile sensing perspectives to reveal the dependency property in multi-context recognition problem. We design a Mobile Context Recognition System, which incorporates a Context-based Activity Recognition (CBAR) modelling approach to produce effective outcome from both multi-stage and multi-target inference processes to recognise human activities and their contexts simultaneously. Upon our empirical evaluation on real-world datasets, the CBAR modelling approach has significantly improved the overall accuracy of simultaneous inference on transportation mode and human activity of mobile users. The accuracy of activity and context recognition can also be influenced progressively by how reliable user annotations are. Essentially, reliable user annotation is required for activity and context recognition. These annotations are usually acquired during data capture in the world. We research the needs of reducing user burden effectively during mobile sensor data collection, through experience sampling of these annotations in-the-wild. To this end, we design CoAct-nnotate --- a technique that aims to improve the sampling of human activities and contexts by providing accurate annotation prediction and facilitates interactive user feedback acquisition for ubiquitous sensing. CoAct-nnotate incorporates a novel multi-view multi-instance learning mechanism to perform more accurate annotation prediction. It also includes a progressive learning process (i.e., model retraining based on co-training and active learning) to improve its predictive performance over time. Moving beyond context recognition of mobile users, human activities can be related to essential tasks that the users perform in daily life. Conversely, the boundaries between the types of tasks are inherently difficult to establish, as they can be defined differently from the individuals' perspectives. Consequently, we investigate the implication of contextual signals for user tasks in mobile sensing applications. To define the boundary of tasks and hence recognise them, we incorporate such situation inference process (i.e., task recognition) into the proposed Intelligent Task Recognition (ITR) framework to learn users' Cyber-Physical-Social activities from their mobile sensing data. By recognising the engaged tasks accurately at a given time via mobile sensing, an intelligent system can then offer proactive supports to its user to progress and complete their tasks. Finally, for robust and effective learning of mobile sensing data from heterogeneous sources (e.g., Internet-of-Things in a mobile crowdsensing scenario), we investigate the utility of sensor data in provisioning their storage and design QDaS --- an application agnostic framework for quality-driven data summarisation. This allows an effective data summarisation by performing density-based clustering on multivariate time series data from a selected source (i.e., data provider). Thus, the source selection process is determined by the measure of data quality. Nevertheless, this framework allows intelligent systems to retain comparable predictive results by its effective learning on the compact representations of mobile sensing data, while having a higher space saving ratio. This thesis contains novel contributions in terms of the techniques that can be employed for mobile situation inference and context recognition, especially in the domain of ubiquitous computing and intelligent assistive technologies. This research implements and extends the capabilities of machine learning techniques to solve real-world problems on multi-context recognition, mobile data summarisation and situation inference from mobile sensing. We firmly believe that the contributions in this research will help the future study to move forward in building more intelligent systems and applications

    Metodología para la selección de la métrica de distancia en Neighborhood Kernels para clasificación semi-supervisada de secuencias proteicas

    Get PDF
    Este trabajo presenta una metodología para la selección de métricas de distancia, entre Geométricas y Bio-inspiradas, en un clasificador semi-supervisado de máquinas de vectores de soporte (SVM), para la clasificación de secuencias proteicas de plantas terrestres (base de datos Embryophyta). Primero se construyó una matriz kernel mediante un proceso de extracción y selección de características, por otro lado, se construyó una matriz para las distancias Euclídea, Mahalanobis, Mismatch y Gappy. Ambas matrices fueron usadas en el algoritmo Neighbordooh kernel para obtener una matriz semi-supervisada para un clasificador SVM optimizado con PSO y W-SVM, cuyo modelo de predicción fue evaluado calculando la matriz de confusión entre los datos de entrenamiento y los datos de prueba obtenidos mediante validación cruzada, posteriormente se calcula la media geométrica con base en la sensibilidad y la especificidad. Los resultados demuestran que la metodología presentada es eficiente para seleccionar la métrica de distancia apropiada según la función molecular. La métrica Euclídea fue seleccionada como la de mejor desempeño para siete funciones, con porcentajes de acierto que van desde 49.94% hasta el 74.3%. Mismatch por su parte, fue seleccionada para tres funciones, con desempeños desde 51.63% hasta 80.78%, y por último, Gappy fue seleccionada para cuatro funciones, con aciertos desde 43.11% hasta 68.5%. Para terminar, es importante resaltar que este proyecto de investigación permitió la creación de la línea de investigación en algoritmos bioinformáticos en el ITM, además derivó cuatro trabajos de grado de pregrado y dos nuevos estudiantes de la Maestría en Automatización y Control IndustrialThis Project presents a methodology to select between Geometric and Bio-inspired distance metrics in a semi-supervised classifier using Support Vector Machine (SVM) to classify protein sequences from land plants (Embryophyta dataset). First, a kernel matrix was built in a process of extraction and feature selection, on the other hand, another matrix was built to Euclidean, Mahalanobis, Mismatch and Gappy distances. Both matrices were used in the Neighborhood kernel algorithm to obtain a semi-supervised matrix to an optimized SVM classifier using PSO and W-SVM. The prediction model was evaluated calculating a confusion matrix between training data and test data, with partitions from cross-validation method; after was calculated a geometric mean with the sensitivity and specificity. The results show that the methodology presented is efficient to select the best metric according to the molecular function. The Euclidean metric was selected as the best one for seven functions, with score from 49.94% to 74.3%. Mismatch was selected for three functions, with score from 51.63% to 80.78%, and Gappy was selected for four functions, with score from 43.11% to 68.5%. On the other hand, it is important to stand out that this work allowed to create a new research line in Bioinformatic algorithm in the ITM, in addition, this one derived four Degree works in Engineering and two new students of Maestría en Automatización y Control industrialMagister en Automatización y Contro

    INForum 2017: Atas do Nono Simpósio de Informática

    Get PDF
    Este volume contém as atas da 9.a edição do Simpósio em Informática, INForum 2017, a qual decorreu no Pavilhão de Exposições de Aveiro, em Aveiro, conjuntamente com o TechDays 2017, nos dias 12 e 13 de outubro de 2017. (...

    Kodikologie und Paläographie im digitalen Zeitalter 3 / Codicology and Palaeography in the Digital Age 3

    Get PDF
    Die zunehmende Verfügbarkeit digitaler Reproduktionen, eine qualitative Verbesserung von Reproduktionstechniken und die Entwicklung neuer Verfahren zur Analyse von Schrift und Beschreibstoffen in den vergangenen Jahren hat in die Zuwendung der historisch orientierten Geisteswissenschaften zur Materialität der schriftlichen Überlieferung gefördert. Anknüpfend an die vorangegangenen Bände der Reihe präsentiert dieser Band aktuelle computergestützte Forschungen zu schriftlichem Kulturgut. Der thematische Rahmen reicht dabei von der Vorstellung neuer Reproduktionstechniken über die Anwendung von Bildmanipulationen zur Lesbarmachung schwer entzifferbarer Manuskripte und lexikostatistische Untersuchungen bis hin zur Vorstellung von Materialdatenbanken zu Beschreibstoffen. Inhalt: Oliver Duntze: Einleitung (IX) Tal Hassner, Malte Rehbein, Peter A. Stokes, Lior Wolf: Computation and Palaeography: Potentials and Limits (1) Digitale Reproduktion als paläographischesWerkzeug / Digital imaging as a palaeographic tool Fabian Hollaus, Melanie Gau, Robert Sablatnig, William A. Christens-Barry, Heinz Miklas: Readability Enhancement and Palimpsest Decipherment of Historical Manuscripts (31) Christine Voth: What lies beneath: The application of digital technology to uncover writing obscured by a chemical reagent (47) Verwaltung von Erschließungsdaten / Organizing descriptive information Rombert Stapel: The development of a medieval scribe (67) Matthieu Bonicel, Dominique Stutzmann: Une application iPad pour l’annotation collaborative des manuscrits médiévaux avec le protocole SharedCanvas: «Formes à toucher» (87) Erwin Frauenknecht, Maria Stieglecker: WZIS – Wasserzeichen-Informationssystem: Verwaltung und Präsentation von Wasserzeichen und ihrer Metadaten (105) Elisa Pallottini: Un corpus di iscrizioni medievali della provincia di Viterbo: Metodologia d’analisi e alcune riflessioni sulla sua informatizzazione (123) Appendices Kurzbiographien – Biographical Notes (137

    Use and Evaluation of Controlled Languages in Industrial Environments and Feasibility Study for the Implementation of Machine Translation

    Get PDF
    El presente trabajo de investigación se enmarca en los estudios de doctorado en traducción y la sociedad del conocimiento de la Universidad de Valencia y, en concreto, en la línea de investigación en tecnologías de la traducción, terminología y localización. En este sentido, esta disertación surge por la necesidad de establecer una metodología de investigación y ofrecer resultados empíricos sobre el desarrollo, implementación y evaluación de lenguajes controlados en la documentación técnica y su efecto tanto en los textos originales como en las traducciones de estos documentos. Así pues, el objetivo ha sido desarrollar una metodología para evaluar el impacto de los lenguajes controlados en la producción de documentación técnica dentro de contextos industriales y, más en concreto, en la elaboración de documentación técnica para el vehículo. El impacto se ha concretado en la mejora de la traducibilidad automática, un concepto que hemos discutido ampliamente en el capítulo 4, así como de la calidad de los textos meta.This research is part of the doctoral studies program "La traducción y la sociedad del conocimiento" at the University of Valencia. In particular the area of ​​research is translation technology, terminology and localisation. In this sense, this dissertation arises from the need to establish a research methodology and to provide empirical results on the development, implementation and evaluation of controlled languages ​​in the technical documentation and its effect on both original texts and the translations of these documents. Thus, the aim has been to develop a methodology to assess the impact of controlled languages ​​in the production of technical documentation in industrial contexts, and more specifically in the technical documentation for the vehicle. The impact has resulted in improved automatic translatability, a concept we have discussed at length in Chapter 4, as well as in the quality of the target texts
    corecore