335 research outputs found
Text mining and natural language processing for the early stages of space mission design
Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes
D6.2 Integrated Final Version of the Components for Lexical Acquisition
The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy
Extracting Negative Biomedical Relations from Literature
Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021The prevalent source for obtaining scientific knowledge remains the scientific literature. Considering that the focus of biomedical research has shifted from individual entities to whole biological systems, understanding the relations between those entities has become paramount for generating knowledge. Relations between entities can either be positive, if there is evidence of an association, or negative, if there is no evidence of an association. To this date, most relation extraction systems focus on extracting positive relations, therefore few knowledge bases contain negative relations. Disregarding negative relations leads to the loss of valuable information that could be used to advance biomedical research. This work presents the Negative Phenotype¬Disease Relations (NPDR) dataset, which describes a subset of negative disease¬phenotype relations from a gold¬standard knowledge base made available by the Human Phenotype Ontology (HPO), and an automatic extraction system developed to automatically annotate the entities and extract the relations from the NPDR dataset. The NPDR dataset was constructed by analysing 177 medical documents and consists of 347 manually annotated at the document¬level relations, from which 222 were inferred from the HPO gold¬standard knowledge base, and 125 were new annotated relations. The main categories of the dataset are the characterization of the entities that participate in the negative relation; the characterization of the sentence that implies the negative relation; and the characterization of the location of the entities and sentences in the article. The automatic extraction system was created to evaluate the impact of the NPDR dataset on the Named-Entity Recognition (NER), Named¬Entity Linking (NEL) and Relation Extraction (RE) text mining tasks. The NER task showed an average of 20.77% more entities annotated when using disease and phenotype synonyms lexica generated from the NPDR dataset, when comparing the number of annotations produced by the OMIM and HPO lexica. The increase in annotated entities also resulted in 15.11% more relations extracted. The RE task performed poorly, with the highest accuracy being 8.84%.Texto livre continua a ser, aos dias de hoje, o principal meio de produção e partilha de conhecimento. Mais concretamente, a literatura biomédica é a principal fonte de conhecimento clínico e biológico para investigadores e clínicos. Porém, à medida que a informação contida em texto livre, correspondente ao número de publicações de artigos científicos aumenta a um ritmo exponencial, torna¬se difícil para os investigadores manterem¬se a par dos desenvolvimentos dos variados domínios científicos. Para além disso, extrair informação textual relevante é uma tarefa laboriosa e morosa para seres humanos, uma vez que a maioria da informação se encontra retida em texto livre não estruturado. Embora esta tarefa possa resultar em erros quando realizada por computadores, só poderá ser alcançada por meio de processos automáticos. Nesse sentido, métodos de prospeção de texto são uma alternativa interessante para reduzir o tempo despendido por especialistas na obtenção de informação relevante, para além de também cobrirem um largo volume de dados provenientes da literatura biomédica. Métodos de prospeção de texto incluem várias tarefas, tais como Named¬Entity Recognition (NER), Named¬Entity Linking (NEL) e Extração de Relações (ER). O NER identifica as entidades mencionadas no texto, o NEL mapeia as entidades reconhecidas a entradas numa base de dados, e o ER identifica relações entre as entidades reconhecidas. Visto que o foco da investigação biomédica mudou de entidades individuais, tais como genes, proteínas ou fármacos, para sistemas biológicos num todo, métodos de ER automáticos tornaram¬se fundamentais para entender relações entre entidades, tais como interações proteína¬proteína, interações fármaco¬fármaco, ou relações gene¬doença. Estas relações podem ser classificadas como negativas, caso haja evidência de não associação entre as entidades, ou positivas, caso haja evidência de associação entre as entidades. ER pode ser efetuada através de múltiplas abordagens que diferem nos métodos que empregam. Essas abordagens podem ser divididas nos seguintes grupos: coocorrência, que é a abordagem mais simples, uma vez que apenas visa a identificação das entidades na mesma frase; baseada em regras, que são definidas manualmente ou automaticamente; e aprendizagem automática, que utiliza corpora biomédica anotada para aplicar supervisão distante. Métodos de supervisão distante podem ainda ser categorizados em feature¬based e kernel¬based. Aos dias de hoje, a maioria dos sistemas de ER não diferenciam entre relações positivas, negativas ou falsas, porém podem¬se salientar algumas excepções, tais como os sistemas Excerbt e BeFree. O primeiro combina análises sintáticas e semânticas com abordagens de regras e aprendizagem automática, e foi adaptado de forma a detetar representações léxicas negadas de itens léxicos (tais como verbos, nomes ou adjetivos) para a anotação do Negatome, uma base de dados de proteínas que não interagem entre si. O segundo sistema utiliza uma combinação de métodos kernelbased, nomeadamente o Shallow Linguistic Kernel e Dependency Kernel. Para a anotação do corpus GAD usando este sistema, também foi treinado um classificador para distinguir entre relações positivas, negativas e falsas entre genes e doenças. Estima¬se que 13.5% das frases de resumos da literatura biomédica possuem expressões negadas. Desconsiderar expressões que poderão, potencialmente, conter relações negativas pode levar à perda de informação valiosa. Porém, a maioria das bases de dados de extrações de relações biomédicas visam apenas recolher relações positivas entre entidades biomédicas. No entanto, exemplos negativos e positivos são igualmente importantes para treinar, afinar e avaliar sistemas de extração de relações. Contudo, uma vez que os exemplos negativos não se encontram tão documentados como os positivos, poucas bases de dados os contêm. Para além disso, a maioria das bases de dados de extração de relações biomédicas não diferencia entre relações falsas, em que duas relações não estão relacionadas, e negativas, em que existe afirmação de não associação entre duas entidades. Adicionalmente, alguns datasets de padrão prata (compostos por dados gerados de forma automática) também contêm relações negativas falsas que são desconhecidas ou não estão documentadas. Logo, a exploração dessas relações é um bom ponto de partida para expandir as bases de dados de relações biomédicas e populá¬las com exemplos negativos corretos. Este trabalho produziu um dataset de anotações de fenótipos e doenças humanas e as suas relações negativas, o datasetNegative Phenotype¬Disease Relations(NPDR), e um módulo de anotação automática de entidades e relações. Para a realização da primeira etapa da criação do dataset NPDR, foi necessário re alizar a recolha dos identificadores PubMed (PMIDs) associados à relações negativas descritas numa base de dados padrão¬ouro, disponibilizada pela Human Phenotype Ontology (HPO). A partir desses PMIDs foi possível extrair artigos completos que foram subsequentemente analisados manualmente. Essa análise consistiu na descrição das entidades que participam na relação negativa, que compreende a análise dos fenótipos, doenças e os seus genes associados; a descrição das frases que sugerem a relação a negativa, que engloba a caracterização do token de negação usado na frase e a coocorrência das entidades; e a descrição da localização das entidades e frases no artigo. O dataset NPDR contem um total de 347 relações anotadas ao nível do documento, das quais 222 foram obtidas a partir da base de dados padrão¬ouro da HPO, e 125 são novas relações. De forma a avaliar o impacto do dataset NPDR na anotação e extração automática de entidades e as suas relações, a partir dos artigos reunidos para o desenvolvimento da criação do dataset, um pipeline que realiza NER, ER e extrai frases de negação foi implementado. NER reconhece fenótipos humanos e doenças, e ER extrai e classifica a relação entre as entidades. De modo a obter os artigos num formato que fosse legível por máquina, dois métodos foram empregues. O primeiro método consistiu em reunir os PMIDs a partir do dataset NPDR, para os converter nos seus identificadores PubMed Central (PMCIDs) correspondentes, de forma a extrair os artigos completos usando a API do PubMed. O segundo método consistiu na conversão dos artigos reunidos para a construção do dataset NPDR em formato PDF para formato de texto, utilizando a ferramenta de extração de texto PDFMiner. A etapa NER foi realizada usando a ferramenta Minimal Name¬Entity Recognizer (MER) para extrair menções de fenótipos, doenças e genes a partir dos artigos. Por fim, utilizando uma abordagem de supervisão distante, a base de dados padrão¬ouro da HPO foi usada para obter as relações obtidas pela ocorrência de fenótipos nas frases que sugerem a relação negativa, e a ocorrência de doenças e genes relacionados presentes no ar tigo. As relações foram marcadas como Conhecida se a relação estivesse descrita na base de dados, ou Desconhecida caso contrário. Para a anotação de fenótipos dois léxicos foram utilizados, um de termos oficiais da HPO, e outro de sinónimos obtidos a partir do dataset NPDR. Para a anotação de doenças e genes, o léxico principal foi obtido a partir da base de dados da Online Mendelian Inheritance in Man (OMIM), e os restantes léxicos foram construídos a partir de sinónimos e abreviaturas de doenças presentes no dataset NPDR. A adição dos léxicos provenientes do dataset NPDR permitiram anotar, em média, mais 20.77% de entidades, comparativamente à anotação de entidades com os léxicos da HPO e OMIM. Este maior número de entidades também se refletiu num aumento de 15.11% de relações anotadas. A tarefa de ER teve um desempenho fraco, sendo que a precisão de relações negativas detetadas foi de 8.84%
A review of sentiment analysis research in Arabic language
Sentiment analysis is a task of natural language processing which has
recently attracted increasing attention. However, sentiment analysis research
has mainly been carried out for the English language. Although Arabic is
ramping up as one of the most used languages on the Internet, only a few
studies have focused on Arabic sentiment analysis so far. In this paper, we
carry out an in-depth qualitative study of the most important research works in
this context by presenting limits and strengths of existing approaches. In
particular, we survey both approaches that leverage machine translation or
transfer learning to adapt English resources to Arabic and approaches that stem
directly from the Arabic language
SARE: A sentiment analysis research environment
Sentiment analysis is an important learning problem with a broad scope of applications. The meteoric rise of online social media and the increasing significance of public opinion expressed therein have opened doors to many challenges as well as opportunities for this research. The challenges have been articulated in the literature through a growing list of sentiment analysis problems and tasks, while the opportunities are constantly being availed with the introduction of new algorithms and techniques for solving them. However, these approaches often remain out of the direct reach of other researchers, who have to either rely on benchmark datasets, which are not always available, or be inventive with their comparisons. This thesis presents Sentiment Analysis Research Environment (SARE), an extendable and publicly-accessible system designed with the goal of integrating baseline and state of- the-art approaches to solving sentiment analysis problems. Since covering the entire breadth of the field is beyond the scope of this work, the usefulness of this environment is demonstrated by integrating solutions for certain facets of the aspect-based sentiment analysis problem. Currently, the system provides a semi-automatic method to support building gold-standard lexica, an automatic baseline method for extracting aspect expressions, and a pre-existing baseline sentiment analysis engine. Users are assisted in creating gold-standard lexica by applying our proposed set cover approximation algorithm, which finds a significantly reduced set of documents needed to create a lexicon. We also suggest a baseline semi-supervised aspect expression extraction algorithm based on a Support Vector Machine (SVM) classifier to automatically extract aspect expressions
User Interfaces to the Web of Data based on Natural Language Generation
We explore how Virtual Research Environments based on Semantic Web technologies support research interactions with RDF data in various stages of corpus-based analysis, analyze the Web of Data in terms of human readability, derive labels from variables in SPARQL queries, apply Natural Language Generation to improve user interfaces to the Web of Data by verbalizing SPARQL queries and RDF graphs, and present a method to automatically induce RDF graph verbalization templates via distant supervision
Sentiment Analysis for micro-blogging platforms in Arabic
Sentiment Analysis (SA) concerns the automatic extraction and classification of
sentiments conveyed in a given text, i.e. labelling a text instance as positive, negative
or neutral. SA research has attracted increasing interest in the past few years due
to its numerous real-world applications. The recent interest in SA is also fuelled
by the growing popularity of social media platforms (e.g. Twitter), as they provide
large amounts of freely available and highly subjective content that can be readily
crawled.
Most previous SA work has focused on English with considerable success. In
this work, we focus on studying SA in Arabic, as a less-resourced language. This
work reports on a wide set of investigations for SA in Arabic tweets, systematically
comparing three existing approaches that have been shown successful in English.
Specifically, we report experiments evaluating fully-supervised-based (SL), distantsupervision-
based (DS), and machine-translation-based (MT) approaches for SA.
The investigations cover training SA models on manually-labelled (i.e. in SL methods)
and automatically-labelled (i.e. in DS methods) data-sets. In addition, we
explored an MT-based approach that utilises existing off-the-shelf SA systems for
English with no need for training data, assessing the impact of translation errors on
the performance of SA models, which has not been previously addressed for Arabic
tweets. Unlike previous work, we benchmark the trained models against an independent
test-set of >3.5k instances collected at different points in time to account
for topic-shifts issues in the Twitter stream. Despite the challenging noisy medium
of Twitter and the mixture use of Dialectal and Standard forms of Arabic, we show
that our SA systems are able to attain performance scores on Arabic tweets that
are comparable to the state-of-the-art SA systems for English tweets.
The thesis also investigates the role of a wide set of features, including syntactic,
semantic, morphological, language-style and Twitter-specific features. We introduce
a set of affective-cues/social-signals features that capture information about the
presence of contextual cues (e.g. prayers, laughter, etc.) to correlate them with the
sentiment conveyed in an instance. Our investigations reveal a generally positive
impact for utilising these features for SA in Arabic. Specifically, we show that a rich
set of morphological features, which has not been previously used, extracted using
a publicly-available morphological analyser for Arabic can significantly improve the
performance of SA classifiers. We also demonstrate the usefulness of languageindependent
features (e.g. Twitter-specific) for SA. Our feature-sets outperform
results reported in previous work on a previously built data-set
- …