22 research outputs found

    A Hybrid Framework for Text Analysis

    Get PDF
    2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists and Computer Scientists. The rst ones, with a strong knowledge of language structures, have not engineering skills. The second ones, contrariwise, expert in computer and mathematics skills, do not assign values to basic mechanisms and structures of language. Moreover, this discrepancy, especially in the last decades, has increased due to the growth of computational resources and to the gradual computerization of the world; the use of Machine Learning technologies in Arti cial Intelligence problems solving, which allows for example the machines to learn , starting from manually generated examples, has been more and more often used in Computational Linguistics in order to overcome the obstacle represented by language structures and its formal representation. The dichotomy has resulted in the birth of two main approaches to Computational Linguistics that respectively prefers: rule-based methods, that try to imitate the way in which man uses and understands the language, reproducing syntactic structures on which the understanding process is based on, building lexical resources as electronic dictionaries, taxonomies or ontologies; statistic-based methods that, conversely, treat language as a group of elements, quantifying words in a mathematical way and trying to extract information without identifying syntactic structures or, in some algorithms, trying to confer to the machine the ability to learn these structures. One of the main problems is the lack of communication between these two di erent approaches, due to substantial di erences characterizing them: on the one hand there is a strong focus on how language works and on language characteristics, there is a tendency to analytical and manual work. From other hand, engineering perspective nds in language an obstacle, and recognizes in the algorithms the fastest way to overcome this problem. However, the lack of communication is not only an incompatibility: following Harris, the best way to approach natural language, could result by taking the best of both. At the moment, there is a large number of open-source tools that perform text analysis and Natural Language Processing. A great part of these tools are based on statistical models and consist on separated modules which could be combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User Interface) and they result impossible to use for users without programming skills. Furthermore, the vast majority of these open-source tools support only English language and, when Italian language is included, the performances of the tools decrease signi cantly. On the other hand, open source tools for Italian language are very few. In this work we want to ll this gap by present a new hybrid framework for the analysis of Italian texts. It must not be intended as a commercial tool, but the purpose for which it was built is to help linguists and other scholars to perform rapid text analysis and to produce linguistic data. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes, in the beginning, the basic algorithms to perform di erent kind of analysis. Modules will perform the following tasks: Preprocessing Module: a module with which it is possible to charge a text, normalize it or delete stop-words. As output, the module presents the list of tokens and letters which compose the texts with respective occurrences count and the processed text. Mr. Ling Module: a module with which POS tagging and Lemmatization are performed. The module also returns the table of lemmas with the count of occurrences and the table with the quanti cation of grammatical tags. Statistic Module: with which it is possible to calculate Term Frequency and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units and export results as tables. Semantic Module: which use The Hyperspace Analogue to Language algorithm to calculate semantic similarity between words. The module returns similarity matrices of words per word which can be exported and analyzed. SyntacticModule: which analyze syntax structures of a selected sentence and tag the verbs and its arguments with semantic labels. The objective of the Framework is to build an all-in-one platform for NLP which allows any kind of users to perform basic and advanced text analysis. With the purpose of make the Framework accessible to users who have not speci c computer science and programming language skills, the modules have been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained in the previous lines, it uses both statistical and rule/based methods, by relying on standard statistical algorithms or techniques, and, at the same time, on Lexicon-Grammar syntactic theory. In addition, it has been written in both Java and Python programming languages. LG-Starship Framework has a simple Graphic User Interface but will be also released as separated modules which may be included in any NLP pipelines independently. There are many resources of this kind, but the large majority works for English. There are very few free resources for Italian language and this work tries to cover this need by proposing a tool which can be used both by linguists or other scientist interested in language and text analysis who have no idea about programming languages, as by computer scientists, who can use free modules in their own code or in combination with di erent NLP algorithms. The Framework takes the start from a text or corpus written directly by the user or charged from an external resource. The LG-Starship Framework work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original imported or generated text in order to produce a clean and normalized preprocessed text. This module includes a function for text splitting, a stop-word list and a tokenization method. On the text preprocessed the Statistic Module or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces as output databases of lexical and numerical data which can be used to produce charts or perform more external analysis; the second one, is divided in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?] and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of- Speech Tagging and produce an annotated text. A lemmatization method, which relies on a set of electronic dictionaries developed at the University of Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and produces a new lemmatized version of original text with information about syntactic and semantic properties. This lemmatized text, which can also be processed with the Statistic Module, serves as input for two deeper level of text analysis carried out by both the Syntactic Module and the Semantic Module. The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and use a database of Predicate structures in development at the Department of Political, Social and Communication Science. Its objective is to produce a Dependency Graph of the sentences that compose the text. The Semantic Module uses the Hyperspace Analogue to Language distributional semantics algorithm [Lund and Burgess, 1996] trained on the Paisà Corpus to produce a semantic network of the words of the text. These work ow has been included in two di erent experiments in which two User Generated Corpora have been involved. The rst experiment represent a statistical study of the language of Rap Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded from on line databases of user generated lyrics. The second experiment is a Feature-Based Sentiment Analysis project performed on user product reviews. For this project we integrated a large domain database of linguistic resources for Sentiment Analysis, developed in the past years by the Department of Political, Social and Communication Science of the University of Salerno, which consists of polarized dictionaries of Verbs, Adjectives, Adverbs and Nouns. These two experiment underline how the linguistic framework can be applied to di erent level of analysis and to produce both Qualitative data and Quantitative data. For what concern the obtained results, the Framework, which is only at a Beta Version, obtain discrete results both in terms of processing time that in terms of precision. Nevertheless, the work is far from being considered complete. More algorithms will be added to the Statistic Module and the Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version of the modules will be published. [edited by author]XV n.s

    Using Natural Language Processing to Mine Multiple Perspectives from Social Media and Scientific Literature.

    Full text link
    This thesis studies how Natural Language Processing techniques can be used to mine perspectives from textual data. The first part of the thesis focuses on analyzing the text exchanged by people who participate in discussions on social media sites. We particularly focus on threaded discussions that discuss ideological and political topics. The goal is to identify the different viewpoints that the discussants have with respect to the discussion topic. We use subjectivity and sentiment analysis techniques to identify the attitudes that the participants carry toward one another and toward the different aspects of the discussion topic. This involves identifying opinion expressions and their polarities, and identifying the targets of opinion. We use this information to represent discussions in one of two representations: discussant attitude vectors or signed attitude networks. We use data mining and network analysis techniques to analyze these representations to detect rifts in discussion groups and study how the discussants split into subgroups with contrasting opinions. In the second part of the thesis, we use linguistic analysis to mine scholars perspectives from scientific literature through the lens of citations. We analyze the text adjacent to reference anchors in scientific articles as a means to identify researchers' viewpoints toward previously published work. We propose methods for identifying, extracting, and cleaning citation text. We analyze this text to identify the purpose (author's intention) and polarity (author's sentiment) of citation. Finally, we present several applications that can benefit from this analysis such as generating multi-perspective summaries of scientific articles and predicting future prominence of publications.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99934/1/amjbara_1.pd

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas

    Named Entities Recognition for Machine Translation: A Case Study on the Importance of Named Entities for Customer Support

    Get PDF
    The last two decades have been of significant change in the international panorama at all levels. The onset of the internet and content availability has propelled us to a new era: The Information Age. The staggering growth of new digital contents, either in the form of ebooks, on-demand TV shows, blogs or even e-commerce websites, has led to an increase in the need for translated material, influenced by people's demand for a quick access to this shared knowledge in their native languages and dialects. Fortunately, machine translation technologies (MT), which provide in many cases human-like translations, are now more widely available, enabling quicker translations for multiple languages at more affordable prices. This work describes the Natural Language Process (NLP) sub-task known as Named Entity Recognition (NER), performed by Unbabel, a Portuguese Machine-translation start-up that combines MT with human post-edition and focuses strictly on customer service content, to improve translation quality outputs. The main objective of this study is to contribute to furthering MT quality and good-practices by exposing the importance of having a continuously-in-development robust Named Entity Recognition system for generic and client-specific content in an MT pipeline and for General Data Protection Regulation (GDPR) compliance; moreover, having in mind future applications, we have tested strategies that support the creation of Multilingual Named Entities Recognition Systems. In the following work, we will first define the meaning of Named Entity, highlighting its importance in a Machine Translation scenario, followed by a brief historical overview of the subject. We will also provide a reasonable description of the most recent data-driven Machine Translation technologies. Concerning the main topic of this work, we will describe three experiments carried out jointly with Unbabel´s NLP team. The first experiment focuses on assisting the NLP team in the creation of a domain-specific Named Entity Recognition (NER) system. The second and third experiments explore the possibilities to create in a semi-automatically fashion multilingual NER gold standards, by resorting to aligners able to project Named Entities between a parallel corpus.As últimas duas décadas têm sido de grandes mudanças a todos os níveis. O início da internet e a disponibilidade de conteúdos veio impulsionar-nos para uma nova era: a Era da Informação. O impressionante aumento de novos conteúdos digitais, sejam eles em forma de ebooks, programas de televisão sempre disponíveis quando solicitados, blogs ou mesmo sites na internet de vendas ao público, levou a um aumento de material traduzido, influenciado em grande parte pelo facto de as pessoas exigirem um acesso rápido a estes conhecimentos partilhados nas suas línguas nativas ou dialetos. Felizmente, as novas tecnologias de tradução automática (TA), que em muitos casos apresentam uma qualidade que rivaliza com as traduções humanas, estão agora amplamente disponíveis, permitindo traduções para uma panóplia de diferentes línguas, em tempo recorde e a melhores preços do que os praticados por tradutores humanos. O presente trabalho dedica-se a descrever a sub-tarefa no campo de Processamento de Língua Natural (PLN) denominada de Reconhecimento de Entidades Mencionadas (REM), utilizada pela Unbabel, uma startup portuguesa que combina tradução automática com pós-edição humana, de forma a melhorar a qualidade das traduções automáticas, e que se foca principalmente em conteúdos provenientes da área do apoio ao cliente. O principal objetivo deste trabalho é contribuir para um crescente aumento da qualidade das traduções automáticas e para fomentar as boas práticas na área da tradução automática, expondo a importância de manter um sistema de Reconhecimento de Entidades Mencionadas robusto e em constante evolução no seu ciclo de tradução, capaz de articular diferentes tipos de conteúdo, do mais genérico ao mais específico, e para cumprir as disposições sobre a proteção de dados exigidas pelo Regulamento Geral sobre a Proteção de Dados (RGPD); adicionalmente, e tendo em conta possíveis aplicações futuras, foram testadas estratégias inovadoras que permitem e fomentam a criação de um sistema de Reconhecimento de Entidades Mencionadas multilíngue. No presente documento, iremos primeiro definir o significado de Entidade Mencionada, explicitando a sua importância num contexto de tradução automática. Num segundo momento, será dada uma panorâmica histórica sobre o tema. Adicionalmente, também iremos fazer um enquadramento histórico sobre os próprios sistemas de tradução automáticos, com um especial foco nas mais recentes tecnologias desenvolvidas com base em dados e sistemas de Inteligência Artificial. No que se refere ao tema principal do nosso trabalho, iremos descrever as três experiências levadas a cabo durante o estágio na Unbabel. Todas as experiências efetuadas tiveram como base os dados reais de clientes dos mais diversos domínios, com cada corpus utilizado nas experiências, sendo selecionados de acordo com os objetivos finais de cada experiência. A primeira experiência, que teve como objetivo auxiliar a equipa de Inteligência Artificial da Unbabel a desenvolver e testar um sistema automático de Reconhecimento de Entidades Mencionadas na área da entrega de comida ao domicílio, previu a possibilidade futura de se conseguir adaptar estes tipos de sistema a qualquer domínio ou clientes específicos. Com esta experiência foram dados os primeiros passos na Unbabel para a criação de um sistemas de Reconhecimento de Entidades de domínio específico. Em relação ao trabalho desenvolvido, começámos por apresentar e testar uma metodologia de identificação de tipos de Entidades Mencionadas comuns ao domínio acima mencionado. Neste sentido, um extenso corpus na área foi compilado e analisado, sendo possível identificar quatro tipos, e.g., categorias, de Entidades Mencionadas relevantes para o domínio, Restaurant Names; Restaurant Chains; Dish Names; Beverages. Posteriormente, foram criadas diretrizes de anotação para cada nova categoria, acabando estas por serem adicionadas à tipologia de anotação de Entidades Mencionadas já existente na Unbabel, incluindo 27 EM de foro mais genérico, tais como: Localização; Moedas; Medidas; Endereços; Produtos e Serviços e Cidades. Num segundo momento, foi feita uma tarefa de anotação sobre um novo corpus da mesma área composto por 14426 frases, com vista à construção de gold standards, a serem utilizados para a aprendizagem dos sistemas automáticos de Reconhecimento de Entidades Mencionadas e para testar os resultados dos mesmos. Para esta tarefa, fizemos uso das novas diretrizes, permitindo testá-las. Dois modelos foram treinados, um com apenas o gold standard do domínio específico, o outro com o gold standard do domínio específico e com todas as anotações de Entidades Mencionadas disponíveis. Desta forma, foi possível determinar qual dos dois obteve melhores resultados. No que se refere aos resultados obtidos, determinou-se que o gold standard do domínio específico não apresentava exemplos suficientes para treino e teste do novo Sistema de Reconhecimento de Entidades Mencionadas. Mesmo assim, foi possível obter resultados referentes à categoria Dish Names, que permitiram concluir que de ambos os modelos, aquele treinado com o gold standard do domínio específico conseguiu obter melhores resultados, identificando mais Dish Names de forma correta no corpus de teste. A segunda experiência focou-se em testar a estratégia de criação automática de gold standards multilíngues de Entidades Mencionadas para aprendizagem de sistemas automáticos, recorrendo a sistemas de alinhamentos de Entidades Mencionadas em bitextos (textos paralelos bilíngues). Para esta experiência foi usado um corpus em inglês (EN) traduzido para alemão (DE) na área do Turismo com 2500 frases e quatro sistemas de alinhamento de palavras de última geração. Em relação a esta experiência, começamos por submeter o corpus traduzido (DE) a um processo de anotação manual de Entidades Mencionadas, utilizando para tal as diretrizes de anotação de Entidades Mencionadas da Unbabel, sendo que para esta experiência não foram consideradas as novas Entidades da primeira experiência. Com a anotação do corpus traduzido feita, foi então possível enviá-lo para alinhamento de Entidades Mencionadas com o corpus homólogo (EN), que havia sido previamente anotado por outro anotator. Os resultados de alinhamento das entidades Mencionadas do bitexto permitiu avaliar o Named Entities inter-annotator agreement, ou seja o valor de acordo entre anotadores, no que se refere à seleção e categorização das diferentes Entidades, de forma a perceber que Entidades apresentam mais dificuldades de anotação. Adicionalmente, com os resultados de alinhamento foi possível determinar o sistema de alinhamento com melhores resultados de entre os quatro sistemas analisados (SimAlign; FastAlign; AwesomeAlign; eflomal). Os resultados de anotação mostraram uma elevada percentagem de inter-annotator agreement, com 87,97% de concordância para algumas categorias. . Adicionalmente, os resultados de alinhamento permitiram estabelecer o SimAlign como o sistema de alinhamento mais eficaz e preciso, suplantando o sistema utilizado pela Unbabel, FastAlign. A terceira experiência replicou o processo acima descrito, desta vez usando um bitexto (EN e PT-BR) composto por 360 frases na área da tecnologia Com esta nova experiência, pretendeu-se verificar se os resultados de alinhamento obtidos para o corpus de Turismo EN/DE são replicáveis quando se altera o domínio e os pares de língua. Esta experiência, à semelhança da anterior, previu uma tarefa de anotação de Entidades Mencionadas do corpus em questão (EN e PT-BR), sendo utilizadas as mesmas diretrizes de anotação da anterior experiência. Num segundo momento, o bitexto anotado foi então enviado para alinhamento, sendo utilizados os mesmos sistemas de alinhamento da segunda experiência. Com base nos resultados da experiência, foi possível determinar para cada Entidade Mencionada quais os sistema de alinhamento que obtiveram melhores resultados. Desta análise chegou-se à conclusão de que o sistema de alinhamento automático AwesomeAlign foi o que apresentou melhores resultados, seguido pelo SimAlign, que apresentou um desempenho de alinhamento mais baixo para a categoria de Entidade Mencionadas: Organizações. Em conclusão, com este trabalho pretendemos mostrar a complexidade e importância inerentes às Entidades Mencionadas num pipeline de tradução automática, assim como mostrar a importância de sistemas de reconhecimento de Entidades Mencionadas robusto e adaptável. É expectável que sistemas de Reconhecimento de Entidades Mencionadas treinados com foco em domínios particulares, consigam melhores resultados do que aqueles treinados com dados mais genéricos. De igual forma, salientamos a possibilidade e aplicabilidade de se poder usar diferentes recursos da área de Processamento de Língua Natural, como o uso de sistemas de alinhamento, no auxílio de Reconhecimento de Entidades Mencionadas, como nos casos acima descritos. De uma perspectiva mais linguística, atendemos a questões relacionadas com Entidades Mencionadas ambíguas. Neste ponto, estabeleceu-se quais as entidades que apresentam uma maior variabilidade de anotação entre anotadores, ou seja, aquelas em que houve um maior desacordo entre anotadores no que se refere às suas classificações, tentando encontrar justificações e soluções para este problema

    Detecting subjectivity through lexicon-grammar. strategies databases, rules and apps for the italian language

    Get PDF
    2014 - 2015The present research handles the detection of linguistic phenomena connected to subjectivity, emotions and opinions from a computational point of view. The necessity to quickly monitor huge quantity of semi-structured and unstructured data from the web, poses several challenges to Natural Language Processing, that must provide strategies and tools to analyze their structures from a lexical, syntactical and semantic point of views. The general aim of the Sentiment Analysis, shared with the broader fields of NLP, Data Mining, Information Extraction, etc., is the automatic extraction of value from chaos; its specific focus instead is on opinions rather than on factual information. This is the aspect that differentiates it from other computational linguistics subfields. The majority of the sentiment lexicons has been manually or automatically created for the English language; therefore, existent Italian lexicons are mostly built through the translation and adaptation of the English lexical databases, e.g. SentiWordNet and WordNet-Affect. Unlike many other Italian and English sentiment lexicons, our database SentIta, made up on the interaction of electronic dictionaries and lexicon dependent local grammars, is able to manage simple and multiword structures, that can take the shape of distributionally free structures, distributionally restricted structures and frozen structures. Moreover, differently from other lexicon-based Sentiment Analysis methods, our approach has been grounded on the solidity of the Lexicon-Grammar resources and classifications, that provides fine-grained semantic but also syntactic descriptions of the lexical entries. According with the major contribution in the Sentiment Analysis literature, we did not consider polar words in isolation. We computed they elementary sentence contexts, with the allowed transformations and, then, their interaction with contextual valence shifters, the linguistic devices that are able to modify the prior polarity of the words from SentIta, when occurring with them in the same sentences. In order to do so, we took advantage of the computational power of the finite-state technology. We formalized a set of rules that work for the intensification, downtoning and negation modeling, the modality detection and the analysis of comparative forms. With regard to the applicative part of the research, we conducted, with satisfactory results, three experiments on the same number of Sentiment Analysis subtasks: the sentiment classification of documents and sentences, the feature-based Sentiment Analysis and the Semantic Role Labeling based on sentiments. [edited by author]XIV n.s

    Semantic Systems. The Power of AI and Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

    Compositional language processing for multilingual sentiment analysis

    Get PDF
    Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] This dissertation presents new approaches in the field of sentiment analysis and polarity classification, oriented towards obtaining the sentiment of a phrase, sentence or document from a natural language processing point of view. It makes a special emphasis on methods to handle semantic composionality, i. e. the ability to compound the sentiment of multiword phrases, where the global sentiment might be different or even opposite to the one coming from each of their their individual components; and the application of these methods to multilingual scenarios. On the one hand, we introduce knowledge-based approaches to calculate the semantic orientation at the sentence level, that can handle different phenomena for the purpose at hand (e. g. negation, intensification or adversative subordinate clauses). On the other hand, we describe how to build machine learning models to perform polarity classification from a different perspective, combining linguistic (lexical, syntactic and semantic) knowledge, with an emphasis in noisy and micro-texts. Experiments on standard corpora and international evaluation campaigns show the competitiveness of the methods here proposed, in monolingual, multilingual and code-switching scenarios. The contributions presented in the thesis have potential applications in the era of the Web 2.0 and social media, such as being able to determine what is the view of society about products, celebrities or events, identify their strengths and weaknesses or monitor how these opinions evolve over time. We also show how some of the proposed models can be useful for other data analysis tasks.[Resumen] Esta tesis presenta nuevas técnicas en el ámbito del análisis del sentimiento y la clasificación de polaridad, centradas en obtener el sentimiento de una frase, oración o documento siguiendo enfoques basados en procesamiento del lenguaje natural. En concreto, nos centramos en desarrollar métodos capaces de manejar la semántica composicional, es decir, con la capacidad de componer el sentimiento de oraciones donde la polaridad global puede ser distinta, o incluso opuesta, de la que se obtendría individualmente para cada uno de sus términos; y cómo dichos métodos pueden ser aplicados en entornos multilingües. En la primera parte de este trabajo, introducimos aproximaciones basadas en conocimiento para calcular la orientación semántica a nivel de oración, teniendo en cuenta construcciones lingüísticas relevantes en el ámbito que nos ocupa (por ejemplo, la negación, intensificación, o las oraciones subordinadas adversativas). En la segunda parte, describimos cómo construir clasificadores de polaridad basados en aprendizaje automático que combinan información léxica, sintáctica y semántica; centrándonos en su aplicación sobre textos cortos y de pobre calidad gramatical. Los experimentos realizados sobre colecciones estándar y competiciones de evaluación internacionales muestran la efectividad de los métodos aquí propuestos en entornos monolingües, multilingües y de code-switching. Las contribuciones presentadas en esta tesis tienen diversas aplicaciones en la era de la Web 2.0 y las redes sociales, como determinar la opinión que la sociedad tiene sobre un producto, celebridad o evento; identificar sus puntos fuertes y débiles o monitorizar cómo estas opiniones evolucionan a lo largo del tiempo. Por último, también mostramos cómo algunos de los modelos propuestos pueden ser útiles para otras tareas de análisis de datos.[Resumo] Esta tese presenta novas técnicas no ámbito da análise do sentimento e da clasificación da polaridade, orientadas a obter o sentimento dunha frase, oración ou documento seguindo aproximacións baseadas no procesamento da linguaxe natural. En particular, centrámosnos en métodos capaces de manexar a semántica composicional: métodos coa habilidade para compor o sentimento de oracións onde o sentimento global pode ser distinto, ou incluso oposto, do que se obtería individualmente para cada un dos seus términos; e como ditos métodos poden ser aplicados en entornos multilingües. Na primeira parte da tese, introducimos aproximacións baseadas en coñecemento; para calcular a orientación semántica a nivel de oración, tendo en conta construccións lingüísticas importantes no ámbito que nos ocupa (por exemplo, a negación, a intensificación ou as oracións subordinadas adversativas). Na segunda parte, describimos como podemos construir clasificadores de polaridade baseados en aprendizaxe automática e que combinan información léxica, sintáctica e semántica, centrándonos en textos curtos e de pobre calidade gramatical. Os experimentos levados a cabo sobre coleccións estándar e competicións de avaliación internacionais mostran a efectividade dos métodos aquí propostos, en entornos monolingües, multilingües e de code-switching. As contribucións presentadas nesta tese teñen diversas aplicacións na era da Web 2.0 e das redes sociais, como determinar a opinión que a sociedade ten sobre un produto, celebridade ou evento; identificar os seus puntos fortes e febles ou monitorizar como esas opinións evolucionan o largo do tempo. Como punto final, tamén amosamos como algúns dos modelos aquí propostos poden ser útiles para outras tarefas de análise de datos

    Text mining and natural language processing for the early stages of space mission design

    Get PDF
    Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes

    Knowledge Modelling and Learning through Cognitive Networks

    Get PDF
    One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot

    Learning to represent, categorise and rank in community question answering

    Get PDF
    The task of Question Answering (QA) is arguably one of the oldest tasks in Natural Language Processing, attracting high levels of interest from both industry and academia. However, most research has focused on factoid questions, e.g. Who is the president of Ireland? In contrast, research on answering non-factoid questions, such as manner, reason, difference and opinion questions, has been rather piecemeal. This was largely due to the absence of available labelled data for the task. This is changing, however, with the growing popularity of Community Question Answering (CQA) websites, such as Quora, Yahoo! Answers and the Stack Exchange family of forums. These websites provide natural labelled data allowing us to apply machine learning techniques. Most previous state-of-the-art approaches to the tasks of CQA-based question answering involved handcrafted features in combination with linear models. In this thesis we hypothesise that the use of handcrafted features can be avoided and the tasks can be approached with representation learning techniques, specifically deep learning. In the first part of this thesis we give an overview of deep learning in natural language processing and empirically evaluate our hypothesis on the task of detecting semantically equivalent questions, i.e. predicting if two questions can be answered by the same answer. In the second part of the thesis we address the task of answer ranking, i.e. determining how suitable an answer is for a given question. In order to determine the suitability of representation learning for the task of answer ranking, we provide a rigorous experimental evaluation of various neural architectures, based on feedforward, recurrent and convolutional neural networks, as well as their combinations. This thesis shows that deep learning is a very suitable approach to CQA-based QA, achieving state-of-the-art results on the two tasks we addressed
    corecore