7 research outputs found

    Named Entity Recognition and Text Compression

    Get PDF
    Import 13/01/2017In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.460 - Katedra informatikyvyhově

    Analysis and Modular Approach for Text Extraction from Scientific Figures on Limited Data

    Get PDF
    Scientific figures are widely used as compact, comprehensible representations of important information. The re-usability of these figures is however limited, as one can rarely search directly for them, since they are mostly indexing by their surrounding text (e. g., publication or website) which often does not contain the full-message of the figure. In this thesis, the focus is on making the content of scientific figures accessible by extracting the text from these figures. A modular pipeline for unsupervised text extraction from scientific figures, based on a thorough analysis of the literature, was built to address the problem. This modular pipeline was used to build several unsupervised approaches, to evaluate different methods from the literature and new methods and method combinations. Some supervised approaches were built as well for comparison. One challenge, while evaluating the approaches, was the lack of annotated data, which especially needed to be considered when building the supervised approach. Three existing datasets were used for evaluation as well as two datasets of 241 scientific figures which were manually created and annotated. Additionally, two existing datasets for text extraction from other types of images were used for pretraining the supervised approach. Several experiments showed the superiority of the unsupervised pipeline over common Optical Character Recognition engines and identified the best unsupervised approach. This unsupervised approach was compared with the best supervised approach, which, despite of the limited amount of training data available, clearly outperformed the unsupervised approach.Infografiken sind ein viel verwendetes Medium zur kompakten Darstellung von Kernaussagen. Die Nachnutzbarkeit dieser Abbildungen ist jedoch häufig limitiert, da sie schlecht auffindbar sind, da sie meist über die umschließenden Medien, wie beispielsweise Publikationen oder Webseiten, und nicht über ihren Inhalt indexiert sind. Der Fokus dieser Arbeit liegt auf der Extraktion der textuellen Inhalte aus Infografiken, um deren Inhalt zu erschließen. Ausgehend von einer umfangreichen Analyse verwandter Arbeiten, wurde ein generalisierender, modularer Ansatz für die unüberwachte Textextraktion aus wissenschaftlichen Abbildungen entwickelt. Mit diesem modularen Ansatz wurden mehrere unüberwachte Ansätze und daneben auch noch einige überwachte Ansätze umgesetzt, um diverse Methoden aus der Literatur sowie neue und bisher noch nicht genutzte Methoden zu vergleichen. Eine Herausforderung bei der Evaluation war die geringe Menge an annotierten Abbildungen, was insbesondere beim überwachten Ansatz Methoden berücksichtigt werden musste. Für die Evaluation wurden drei existierende Datensätze verwendet und zudem wurden zusätzlich zwei Datensätze mit insgesamt 241 Infografiken erstellt und mit den nötigen Informationen annotiert, sodass insgesamt 5 Datensätze für die Evaluation verwendet werden konnten. Für das Pre-Training des überwachten Ansatzes wurden zudem zwei Datensätze aus verwandten Textextraktionsbereichen verwendet. In verschiedenen Experimenten wird gezeigt, dass der unüberwachte Ansatz besser funktioniert als klassische Texterkennungsverfahren und es wird aus den verschiedenen unüberwachten Ansätzen der beste ermittelt. Dieser unüberwachte Ansatz wird mit dem überwachten Ansatz verglichen, der trotz begrenzter Trainingsdaten die besten Ergebnisse liefert

    “How cavemen did social media”: A comparative case study of social movement organisations using Twitter to mobilise on climate change

    Get PDF
    In the face of widespread public disillusionment with traditional politics the internet is emerging as a popular tool for increasing public participation in social and political activism. Little research has been performed, however, on how social movement organisations are using the internet and in particular increasingly popular social networking services to mobilise individuals. Accordingly, this thesis presents a comparative case study of three climate change campaigns’ Twitter accounts aiming to identify and analyse the ways they are using it as part of their mobilisation efforts. Use of Twitter varied across all three, reflecting campaign design. However, each case displayed efforts to establish and use online ties and networks to facilitate and sustain participation in low-risk, moderate and symbolic forms of online and offline action. Such findings will provide inspiration for movement activists seeking to use the internet to mobilise on climate change, and open up to greater academic attention the role of social networking services in movement mobilisation

    Formalising Human Mental Workload as a Defeasible Computational Concept

    Get PDF
    Human mental workload has gained importance, in the last few decades, as a fundamental design concept in human-computer interaction. It can be intuitively defined as the amount of mental work necessary for a person to complete a task over a given period of time. For people interacting with interfaces, computers and technological devices in general, the construct plays an important role. At a low level, while processing information, often people feel annoyed and frustrated; at higher level, mental workload is critical and dangerous as it leads to confusion, it decreases the performance of information processing and it increases the chances of errors and mistakes. It is extensively documented that either mental overload or underload negatively affect performance. Hence, designers and practitioners who are ultimately interested in system or human performance need answers about operator workload at all stages of system design and operation. At an early system design phase, designers require some explicit model to predict the mental workload imposed by their technologies on end-users so that alternative system designs can be evaluated. However, human mental workload is a multifaceted and complex construct mainly applied in cognitive sciences. A plethora of ad-hoc definitions can be found in the literature. Generally, it is not an elementary property, rather it emerges from the interaction between the requirements of a task, the circumstances under which it is performed and the skills, behaviours and perceptions of the operator. Although measuring mental workload has advantages in interaction and interface design, its formalisation as an operational and computational construct has not sufficiently been addressed. Many researchers agree that too many ad-hoc models are present in the literature and that they are applied subjectively by mental workload designers thereby limiting their application in different contexts and making comparison across different models difficult. This thesis introduces a novel computational framework for representing and assessing human mental workload based on defeasible reasoning. The starting point is the investigation of the nature of human mental workload that appears to be a defeasible phenomenon. A defeasible concept is a concept built upon a set of arguments that can be defeated by adding additional arguments. The word ‘defeasible’ is inherited from defeasible reasoning, a form of reasoning built upon reasons that can be defeated. It is also known as non-monotonic reasoning because of the technical property (non-monotonicity) of the logical formalisms that are aimed at modelling defeasible reasoning activity. Here, a conclusion or claim, derived from the application of previous knowledge, can be retracted in the light of new evidence. Formally, state-of-the-art defeasible reasoning models are implemented employing argumentation theory, a multi-disciplinary paradigm that incorporates elements of philosophy, psychology and sociology. It systematically studies how arguments can be built, sustained or discarded in a reasoning process, and it investigates the validity of their conclusions. Since mental workload can be seen as a defeasible phenomenon, formal defeasible argumentation theory may have a positive impact in its representation and assessment. Mental workload can be captured, analysed, and measured in ways that increase its understanding allowing its use for practical activities. The research question investigated here is whether defeasible argumentation theory can enhance the representation of the construct of mental workload and improve the quality of its assessment in the field of human-computer interaction. In order to answer this question, recurrent knowledge and evidence employed in state-of-the-art mental workload measurement techniques have been reviewed in the first place as well as their defeasible and non-monotonic properties. Secondly, an investigation of the state-of-the-art computational techniques for implementing defeasible reasoning has been carried out. This allowed the design of a modular framework for mental workload representation and assessment. The proposed solution has been evaluated by comparing the properties of sensitivity, diagnosticity and validity of the assessments produced by two instances of the framework against the ones produced by two well known subjective mental workload assessments techniques (the Nasa Task Load Index and the Workload Profile) in the context of human-web interaction. In detail, through an empirical user study, it has been firstly demonstrated how these two state-of-the-art techniques can be translated into two particular instances of the framework while still maintaining the same validity. In other words, the indexes of mental workload inferred by the two original instruments, and the ones generated by their corresponding translations (instances of the framework) showed a positive and nearly perfect statistical correlation. Additionally, a new defeasible instance built with the framework showed a better sensitivity and a higher diagnosticity capacity than the two selected state-of-the art techniques. The former showed a higher convergent validity with the latter techniques, but a better concurrent validity with performance measures. The new defeasible instance generated indexes of mental workload that better correlated with the objective time for task completion compared to the two selected instruments. These findings support the research question thereby demonstrating how defeasible argumentation theory can be successfully adopted to support the representation of mental workload and to enhance the quality of its assessments. The main contribution of this thesis is the presentation of a methodology, developed as a formal modular framework, to represent mental workload as a defeasible computational concept and to assess it as a numerical usable index. This research contributes to the body of knowledge by providing a modular framework built upon defeasible reasoning and formalised through argumentation theory in which workload can be optimally measured, analysed, explained and applied in different contexts

    Contributions to the privacy provisioning for federated identity management platforms

    Get PDF
    Identity information, personal data and user’s profiles are key assets for organizations and companies by becoming the use of identity management (IdM) infrastructures a prerequisite for most companies, since IdM systems allow them to perform their business transactions by sharing information and customizing services for several purposes in more efficient and effective ways. Due to the importance of the identity management paradigm, a lot of work has been done so far resulting in a set of standards and specifications. According to them, under the umbrella of the IdM paradigm a person’s digital identity can be shared, linked and reused across different domains by allowing users simple session management, etc. In this way, users’ information is widely collected and distributed to offer new added value services and to enhance availability. Whereas these new services have a positive impact on users’ life, they also bring privacy problems. To manage users’ personal data, while protecting their privacy, IdM systems are the ideal target where to deploy privacy solutions, since they handle users’ attribute exchange. Nevertheless, current IdM models and specifications do not sufficiently address comprehensive privacy mechanisms or guidelines, which enable users to better control over the use, divulging and revocation of their online identities. These are essential aspects, specially in sensitive environments where incorrect and unsecured management of user’s data may lead to attacks, privacy breaches, identity misuse or frauds. Nowadays there are several approaches to IdM that have benefits and shortcomings, from the privacy perspective. In this thesis, the main goal is contributing to the privacy provisioning for federated identity management platforms. And for this purpose, we propose a generic architecture that extends current federation IdM systems. We have mainly focused our contributions on health care environments, given their particularly sensitive nature. The two main pillars of the proposed architecture, are the introduction of a selective privacy-enhanced user profile management model and flexibility in revocation consent by incorporating an event-based hybrid IdM approach, which enables to replace time constraints and explicit revocation by activating and deactivating authorization rights according to events. The combination of both models enables to deal with both online and offline scenarios, as well as to empower the user role, by letting her to bring together identity information from different sources. Regarding user’s consent revocation, we propose an implicit revocation consent mechanism based on events, that empowers a new concept, the sleepyhead credentials, which is issued only once and would be used any time. Moreover, we integrate this concept in IdM systems supporting a delegation protocol and we contribute with the definition of mathematical model to determine event arrivals to the IdM system and how they are managed to the corresponding entities, as well as its integration with the most widely deployed specification, i.e., Security Assertion Markup Language (SAML). In regard to user profile management, we define a privacy-awareness user profile management model to provide efficient selective information disclosure. With this contribution a service provider would be able to accesses the specific personal information without being able to inspect any other details and keeping user control of her data by controlling who can access. The structure that we consider for the user profile storage is based on extensions of Merkle trees allowing for hash combining that would minimize the need of individual verification of elements along a path. An algorithm for sorting the tree as we envision frequently accessed attributes to be closer to the root (minimizing the access’ time) is also provided. Formal validation of the above mentioned ideas has been carried out through simulations and the development of prototypes. Besides, dissemination activities were performed in projects, journals and conferences.Programa Oficial de Doctorado en Ingeniería TelemáticaPresidente: María Celeste Campo Vázquez.- Secretario: María Francisca Hinarejos Campos.- Vocal: Óscar Esparza Martí

    Diseño de un modelo conceptual multi-dominio para recomendaciones mediante el filtrado de información semántica en los medios sociales

    Get PDF
    Actualmente los usuarios demandan, cada vez, y de manera más intensa, la búsqueda de distintos contenidos almacenados en la Web. Por un lado, existe una gran cantidad de información en la Web y en los medios sociales, esto es debido a la disponibilidad de información sobre los distintos productos, contenidos y servicios que pueden hacer que un usuario se sienta desbordado al intentar discriminar sobre qué producto, qué contenido o qué servicio cubre sus necesidades. Por otro lado, los Sistemas de Recomendación en las distintas áreas de aplicación son cada vez más frecuentes, ya que son útiles para valorar y filtrar esa gran cantidad de información que se encuentra disponible en la Web desde distintos paradigmas. La necesidad de hacer que los procesos de recomendación sean cada vez más claros, que satisfagan y cumplan con las expectativas de los usuarios ha supuesto una gran importancia al estudio relacionado con los distintos modelos formales semánticos existentes en los Sistemas de Recomendación aplicados en los medios sociales, además debido a que los usuarios utilizan la Web para publicar, editar y compartir sus contenidos. Por lo tanto, el uso de los distintos modelos formales semánticos para recomendaciones en los medios sociales facilitan la información y, a su vez, aportan un valor añadido al generar una representación del conocimiento sobre distintos dominios: además la información sirve de base para generar recomendaciones a partir de Sistemas de Recomendación a los distintos usuarios en la Web. La Web semántica posibilita la convergencia entre el uso y la interacción de las personas y los medio sociales, permitiendo crear una gran variedad de contenidos accesibles a las tecnologías semánticas de la Web, a las técnicas de aprendizaje y el filtrado de información. Además, si añadimos que existen las plataformas de comunicación social en la Web, que surgen ante la necesidad de ofrecer una mayor diversidad de información para proporcionar los diversos contenidos personalizados hacia los diferentes tipos de usuarios. Existen distintos modelos semánticos para Sistemas Basados en Conocimiento que pueden aplicarse en diferentes ámbitos multidisciplinarios, tales como, lenguaje natural, realidad virtual, redes neuronales, juegos masivos, sistemas expertos, robótica, sistemas de planeación, reconocimiento de imágenes, traductores, solución de problemas, sistemas evolutivos y el aprendizaje automático entre otros. Sin embargo, los modelos basados en conocimiento semántico en Sistemas de Recomendación para entornos de medios sociales aún no han sido completamente explotados, constituyendo un área de investigación abierta para la búsqueda de constantes soluciones en la información desde distintos dominios. Por lo tanto, esta investigación plantea el diseño de un nuevo modelo conceptual multi-dominio semántico para la representación del conocimiento sobre los distintos productos, marcas sus características y servicios ofertados en las redes sociales, a su vez, el modelo conceptual multi-dominio puede modelar y gestionar el conocimiento de diferentes perfiles de usuarios, productos y medios sociales caracterizados para distintos dominios, dentro de un contexto de servicios y productos que, sin cambiar sus conceptos principales, el modelo pueda ser aplicado a distintos dominios para la representación del conocimiento. Además de las hipótesis que marcaron las directrices de trabajo y los objetivos planteados, la presente tesis aporta el diseño del propuesto modelo. La metodología seguida para la elaboración de esta tesis, ha consistido en lo siguiente: 1. – Estudio del estado de la cuestión de la investigación. Dicho análisis permitirá conocer la originalidad y los recursos existentes en el área que se pretende abordar. 2. – Definición de un nuevo modelo conceptual multi-dominio basado en el conocimiento semántico. En paralelo al estudio del estado de la cuestión que permite conocer el estudio del problema y, que a su vez, facilita la definición del modelo. El modelo se desarrollará bajo una herramienta de modelado que facilita la gestión de los conceptos representados en el modelo y, un experto que facilita la interpretación de los datos. 3. – Extracción de datos semánticos basados en contenido estructurado, la información será extraída desde las fuentes de información almacenadas en la Web. 4. – Solución preliminar, dicha etapa nos permite conocer los primeros resultados y un primer comportamiento del modelado a partir de la extracción de datos. 5. – Diseño de un marco computacional. Dicha etapa será el desarrollo de un marco de trabajo basado en el modelo propuesto que integrará un Sistema Basado en Conocimiento, un Sistema de Recomendación, los datos semánticos basados en contenido estructurado semántico y la información que será extraída desde la Web. 6. – Validación y experimentación, en esta fase se ha comprobado las hipótesis planteadas en la investigación, además que el modelo desarrollado puede representar el conocimiento relativo al problema, aplicándolo a la representación del conocimiento para distintos dominios a partir de marco computacional desarrollado y, que a su vez es basado en conocimiento semántico y contenido estructurado. 7. – Verificación y análisis de los resultados. Tras la etapa de validación se estudian los resultados obtenidos que permiten comprobar la validez del modelo propuesto en esta investigación. El objetivo de esta valoración es generar conocimiento para diferentes dominios a partir del modelo conceptual, la información almacenada en el sistema sirve para la generación de recomendaciones a partir de un Sistema de Recomendación. Por último, se presentan las conclusiones extraídas de la etapa verificación y análisis de los resultados que permiten comprobar la validez del modelo y las herramientas propuestas para ésta investigación. 8. – Documentación. A lo largo de todo el proceso de elaboración de la tesis se ha generado la documentación que constituye la presente tesis doctoral. Las conclusiones del modelo conceptual multi-dominio abre nuevas posibilidades en el área de la Web semántica, Sistemas Basados en Conocimiento y los modelos formales semánticos pertenecientes al área de la Inteligencia Artificial, específicamente en la concepción y desarrollo de un nuevo modelo conceptual multi-dominio. Además, a partir del modelado de técnicas multi-dominio facilita la búsqueda de soliviones en la información, la toma de decisiones y el empleo de conocimiento especializado en diferentes dominios de aplicación de contenido estructurado y semántico, a su vez, generando información relevante sobre los gustos, necesidades y preferencias de los usuarios permitiendo generar recomendaciones a partir de un Sistema de Recomendación.Gracias al Consejo Nacional de Ciencia y Tecnología (CONACYT) y la Secretaría de Educación Pública (SEP) a través de PROMEP por conceder los recursos económicos para la realización de esta investigación. Asimismo, al proyecto "FLORA: Financial Linked Open Data-based Reasoning and Management for Web Science". (TIN2011-27405).Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Antonio Bibiloni Coll.- Secretario: María Belén Ruiz Mezcua.- Vocal: Giner Alor Hernánde

    A Brokering Approach to Federating Spatial Data in a Semantic Web Environment

    Get PDF
    This thesis proposes a broker approach to federate Australia and New Zealand’s spatial data using Semantic Web technologies and ontologies. The proposed approach improves the current methods of integrating and accessing spatial data in Australia and New Zealand by enabling on-demand access to concurrent data, removing the need for a data warehouse to maintain and store the integrated data, and allowing the semantic reconciliation of heterogeneous spatial datasets
    corecore