335 research outputs found

    Text mining and natural language processing for the early stages of space mission design

    Get PDF
    Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated since artificial satellites started to venture into space in the 1950s. This data has today become an overwhelming volume of information, triggering a significant knowledge reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants, text mining and Natural Language Processing techniques have become pervasive to our daily life. The work presented in this thesis is one of the first attempts to bridge the gap between the worlds of space systems engineering and text mining. Several novel models are thus developed and implemented here, targeting the structuring of accumulated data through an ontology, but also tasks commonly performed by systems engineers such as requirement management and heritage analysis. A first collection of documents related to space systems is gathered for the training of these methods. Eventually, this work aims to pave the way towards the development of a Design Engineering Assistant (DEA) for the early stages of space mission design. It is also hoped that this work will actively contribute to the integration of text mining and Natural Language Processing methods in the field of space mission design, enhancing current design processes

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

    Extracting Negative Biomedical Relations from Literature

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021The prevalent source for obtaining scientific knowledge remains the scientific literature. Considering that the focus of biomedical research has shifted from individual entities to whole biological systems, understanding the relations between those entities has become paramount for generating knowledge. Relations between entities can either be positive, if there is evidence of an association, or negative, if there is no evidence of an association. To this date, most relation extraction systems focus on extracting positive relations, therefore few knowledge bases contain negative relations. Disregarding negative relations leads to the loss of valuable information that could be used to advance biomedical research. This work presents the Negative Phenotype¬Disease Relations (NPDR) dataset, which describes a subset of negative disease¬phenotype relations from a gold¬standard knowledge base made available by the Human Phenotype Ontology (HPO), and an automatic extraction system developed to automatically annotate the entities and extract the relations from the NPDR dataset. The NPDR dataset was constructed by analysing 177 medical documents and consists of 347 manually annotated at the document¬level relations, from which 222 were inferred from the HPO gold¬standard knowledge base, and 125 were new annotated relations. The main categories of the dataset are the characterization of the entities that participate in the negative relation; the characterization of the sentence that implies the negative relation; and the characterization of the location of the entities and sentences in the article. The automatic extraction system was created to evaluate the impact of the NPDR dataset on the Named-Entity Recognition (NER), Named¬Entity Linking (NEL) and Relation Extraction (RE) text mining tasks. The NER task showed an average of 20.77% more entities annotated when using disease and phenotype synonyms lexica generated from the NPDR dataset, when comparing the number of annotations produced by the OMIM and HPO lexica. The increase in annotated entities also resulted in 15.11% more relations extracted. The RE task performed poorly, with the highest accuracy being 8.84%.Texto livre continua a ser, aos dias de hoje, o principal meio de produção e partilha de conhecimento. Mais concretamente, a literatura biomédica é a principal fonte de conhecimento clínico e biológico para investigadores e clínicos. Porém, à medida que a informação contida em texto livre, correspondente ao número de publicações de artigos científicos aumenta a um ritmo exponencial, torna¬se difícil para os investigadores manterem¬se a par dos desenvolvimentos dos variados domínios científicos. Para além disso, extrair informação textual relevante é uma tarefa laboriosa e morosa para seres humanos, uma vez que a maioria da informação se encontra retida em texto livre não estruturado. Embora esta tarefa possa resultar em erros quando realizada por computadores, só poderá ser alcançada por meio de processos automáticos. Nesse sentido, métodos de prospeção de texto são uma alternativa interessante para reduzir o tempo despendido por especialistas na obtenção de informação relevante, para além de também cobrirem um largo volume de dados provenientes da literatura biomédica. Métodos de prospeção de texto incluem várias tarefas, tais como Named¬Entity Recognition (NER), Named¬Entity Linking (NEL) e Extração de Relações (ER). O NER identifica as entidades mencionadas no texto, o NEL mapeia as entidades reconhecidas a entradas numa base de dados, e o ER identifica relações entre as entidades reconhecidas. Visto que o foco da investigação biomédica mudou de entidades individuais, tais como genes, proteínas ou fármacos, para sistemas biológicos num todo, métodos de ER automáticos tornaram¬se fundamentais para entender relações entre entidades, tais como interações proteína¬proteína, interações fármaco¬fármaco, ou relações gene¬doença. Estas relações podem ser classificadas como negativas, caso haja evidência de não associação entre as entidades, ou positivas, caso haja evidência de associação entre as entidades. ER pode ser efetuada através de múltiplas abordagens que diferem nos métodos que empregam. Essas abordagens podem ser divididas nos seguintes grupos: coocorrência, que é a abordagem mais simples, uma vez que apenas visa a identificação das entidades na mesma frase; baseada em regras, que são definidas manualmente ou automaticamente; e aprendizagem automática, que utiliza corpora biomédica anotada para aplicar supervisão distante. Métodos de supervisão distante podem ainda ser categorizados em feature¬based e kernel¬based. Aos dias de hoje, a maioria dos sistemas de ER não diferenciam entre relações positivas, negativas ou falsas, porém podem¬se salientar algumas excepções, tais como os sistemas Excerbt e BeFree. O primeiro combina análises sintáticas e semânticas com abordagens de regras e aprendizagem automática, e foi adaptado de forma a detetar representações léxicas negadas de itens léxicos (tais como verbos, nomes ou adjetivos) para a anotação do Negatome, uma base de dados de proteínas que não interagem entre si. O segundo sistema utiliza uma combinação de métodos kernelbased, nomeadamente o Shallow Linguistic Kernel e Dependency Kernel. Para a anotação do corpus GAD usando este sistema, também foi treinado um classificador para distinguir entre relações positivas, negativas e falsas entre genes e doenças. Estima¬se que 13.5% das frases de resumos da literatura biomédica possuem expressões negadas. Desconsiderar expressões que poderão, potencialmente, conter relações negativas pode levar à perda de informação valiosa. Porém, a maioria das bases de dados de extrações de relações biomédicas visam apenas recolher relações positivas entre entidades biomédicas. No entanto, exemplos negativos e positivos são igualmente importantes para treinar, afinar e avaliar sistemas de extração de relações. Contudo, uma vez que os exemplos negativos não se encontram tão documentados como os positivos, poucas bases de dados os contêm. Para além disso, a maioria das bases de dados de extração de relações biomédicas não diferencia entre relações falsas, em que duas relações não estão relacionadas, e negativas, em que existe afirmação de não associação entre duas entidades. Adicionalmente, alguns datasets de padrão prata (compostos por dados gerados de forma automática) também contêm relações negativas falsas que são desconhecidas ou não estão documentadas. Logo, a exploração dessas relações é um bom ponto de partida para expandir as bases de dados de relações biomédicas e populá¬las com exemplos negativos corretos. Este trabalho produziu um dataset de anotações de fenótipos e doenças humanas e as suas relações negativas, o datasetNegative Phenotype¬Disease Relations(NPDR), e um módulo de anotação automática de entidades e relações. Para a realização da primeira etapa da criação do dataset NPDR, foi necessário re alizar a recolha dos identificadores PubMed (PMIDs) associados à relações negativas descritas numa base de dados padrão¬ouro, disponibilizada pela Human Phenotype Ontology (HPO). A partir desses PMIDs foi possível extrair artigos completos que foram subsequentemente analisados manualmente. Essa análise consistiu na descrição das entidades que participam na relação negativa, que compreende a análise dos fenótipos, doenças e os seus genes associados; a descrição das frases que sugerem a relação a negativa, que engloba a caracterização do token de negação usado na frase e a coocorrência das entidades; e a descrição da localização das entidades e frases no artigo. O dataset NPDR contem um total de 347 relações anotadas ao nível do documento, das quais 222 foram obtidas a partir da base de dados padrão¬ouro da HPO, e 125 são novas relações. De forma a avaliar o impacto do dataset NPDR na anotação e extração automática de entidades e as suas relações, a partir dos artigos reunidos para o desenvolvimento da criação do dataset, um pipeline que realiza NER, ER e extrai frases de negação foi implementado. NER reconhece fenótipos humanos e doenças, e ER extrai e classifica a relação entre as entidades. De modo a obter os artigos num formato que fosse legível por máquina, dois métodos foram empregues. O primeiro método consistiu em reunir os PMIDs a partir do dataset NPDR, para os converter nos seus identificadores PubMed Central (PMCIDs) correspondentes, de forma a extrair os artigos completos usando a API do PubMed. O segundo método consistiu na conversão dos artigos reunidos para a construção do dataset NPDR em formato PDF para formato de texto, utilizando a ferramenta de extração de texto PDFMiner. A etapa NER foi realizada usando a ferramenta Minimal Name¬Entity Recognizer (MER) para extrair menções de fenótipos, doenças e genes a partir dos artigos. Por fim, utilizando uma abordagem de supervisão distante, a base de dados padrão¬ouro da HPO foi usada para obter as relações obtidas pela ocorrência de fenótipos nas frases que sugerem a relação negativa, e a ocorrência de doenças e genes relacionados presentes no ar tigo. As relações foram marcadas como Conhecida se a relação estivesse descrita na base de dados, ou Desconhecida caso contrário. Para a anotação de fenótipos dois léxicos foram utilizados, um de termos oficiais da HPO, e outro de sinónimos obtidos a partir do dataset NPDR. Para a anotação de doenças e genes, o léxico principal foi obtido a partir da base de dados da Online Mendelian Inheritance in Man (OMIM), e os restantes léxicos foram construídos a partir de sinónimos e abreviaturas de doenças presentes no dataset NPDR. A adição dos léxicos provenientes do dataset NPDR permitiram anotar, em média, mais 20.77% de entidades, comparativamente à anotação de entidades com os léxicos da HPO e OMIM. Este maior número de entidades também se refletiu num aumento de 15.11% de relações anotadas. A tarefa de ER teve um desempenho fraco, sendo que a precisão de relações negativas detetadas foi de 8.84%

    A review of sentiment analysis research in Arabic language

    Full text link
    Sentiment analysis is a task of natural language processing which has recently attracted increasing attention. However, sentiment analysis research has mainly been carried out for the English language. Although Arabic is ramping up as one of the most used languages on the Internet, only a few studies have focused on Arabic sentiment analysis so far. In this paper, we carry out an in-depth qualitative study of the most important research works in this context by presenting limits and strengths of existing approaches. In particular, we survey both approaches that leverage machine translation or transfer learning to adapt English resources to Arabic and approaches that stem directly from the Arabic language

    SARE: A sentiment analysis research environment

    Get PDF
    Sentiment analysis is an important learning problem with a broad scope of applications. The meteoric rise of online social media and the increasing significance of public opinion expressed therein have opened doors to many challenges as well as opportunities for this research. The challenges have been articulated in the literature through a growing list of sentiment analysis problems and tasks, while the opportunities are constantly being availed with the introduction of new algorithms and techniques for solving them. However, these approaches often remain out of the direct reach of other researchers, who have to either rely on benchmark datasets, which are not always available, or be inventive with their comparisons. This thesis presents Sentiment Analysis Research Environment (SARE), an extendable and publicly-accessible system designed with the goal of integrating baseline and state of- the-art approaches to solving sentiment analysis problems. Since covering the entire breadth of the field is beyond the scope of this work, the usefulness of this environment is demonstrated by integrating solutions for certain facets of the aspect-based sentiment analysis problem. Currently, the system provides a semi-automatic method to support building gold-standard lexica, an automatic baseline method for extracting aspect expressions, and a pre-existing baseline sentiment analysis engine. Users are assisted in creating gold-standard lexica by applying our proposed set cover approximation algorithm, which finds a significantly reduced set of documents needed to create a lexicon. We also suggest a baseline semi-supervised aspect expression extraction algorithm based on a Support Vector Machine (SVM) classifier to automatically extract aspect expressions

    User Interfaces to the Web of Data based on Natural Language Generation

    Get PDF
    We explore how Virtual Research Environments based on Semantic Web technologies support research interactions with RDF data in various stages of corpus-based analysis, analyze the Web of Data in terms of human readability, derive labels from variables in SPARQL queries, apply Natural Language Generation to improve user interfaces to the Web of Data by verbalizing SPARQL queries and RDF graphs, and present a method to automatically induce RDF graph verbalization templates via distant supervision

    ADDRESSING INFORMALITY IN PROCESSING CHINESE MICROTEXT

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Sentiment Analysis for micro-blogging platforms in Arabic

    Get PDF
    Sentiment Analysis (SA) concerns the automatic extraction and classification of sentiments conveyed in a given text, i.e. labelling a text instance as positive, negative or neutral. SA research has attracted increasing interest in the past few years due to its numerous real-world applications. The recent interest in SA is also fuelled by the growing popularity of social media platforms (e.g. Twitter), as they provide large amounts of freely available and highly subjective content that can be readily crawled. Most previous SA work has focused on English with considerable success. In this work, we focus on studying SA in Arabic, as a less-resourced language. This work reports on a wide set of investigations for SA in Arabic tweets, systematically comparing three existing approaches that have been shown successful in English. Specifically, we report experiments evaluating fully-supervised-based (SL), distantsupervision- based (DS), and machine-translation-based (MT) approaches for SA. The investigations cover training SA models on manually-labelled (i.e. in SL methods) and automatically-labelled (i.e. in DS methods) data-sets. In addition, we explored an MT-based approach that utilises existing off-the-shelf SA systems for English with no need for training data, assessing the impact of translation errors on the performance of SA models, which has not been previously addressed for Arabic tweets. Unlike previous work, we benchmark the trained models against an independent test-set of >3.5k instances collected at different points in time to account for topic-shifts issues in the Twitter stream. Despite the challenging noisy medium of Twitter and the mixture use of Dialectal and Standard forms of Arabic, we show that our SA systems are able to attain performance scores on Arabic tweets that are comparable to the state-of-the-art SA systems for English tweets. The thesis also investigates the role of a wide set of features, including syntactic, semantic, morphological, language-style and Twitter-specific features. We introduce a set of affective-cues/social-signals features that capture information about the presence of contextual cues (e.g. prayers, laughter, etc.) to correlate them with the sentiment conveyed in an instance. Our investigations reveal a generally positive impact for utilising these features for SA in Arabic. Specifically, we show that a rich set of morphological features, which has not been previously used, extracted using a publicly-available morphological analyser for Arabic can significantly improve the performance of SA classifiers. We also demonstrate the usefulness of languageindependent features (e.g. Twitter-specific) for SA. Our feature-sets outperform results reported in previous work on a previously built data-set

    A Corpus-Based Approach for the Induction of Ontology Lexica

    Full text link
    corecore