22 research outputs found
A Hybrid Framework for Text Analysis
2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists
and Computer Scientists. The rst ones, with a strong knowledge of
language structures, have not engineering skills. The second ones, contrariwise,
expert in computer and mathematics skills, do not assign values to basic
mechanisms and structures of language. Moreover, this discrepancy, especially
in the last decades, has increased due to the growth of computational
resources and to the gradual computerization of the world; the use of Machine
Learning technologies in Arti cial Intelligence problems solving, which
allows for example the machines to learn , starting from manually generated
examples, has been more and more often used in Computational Linguistics
in order to overcome the obstacle represented by language structures and its
formal representation.
The dichotomy has resulted in the birth of two main approaches to Computational
Linguistics that respectively prefers:
rule-based methods, that try to imitate the way in which man uses and
understands the language, reproducing syntactic structures on which
the understanding process is based on, building lexical resources as electronic
dictionaries, taxonomies or ontologies;
statistic-based methods that, conversely, treat language as a group of
elements, quantifying words in a mathematical way and trying to extract
information without identifying syntactic structures or, in some
algorithms, trying to confer to the machine the ability to learn these
structures.
One of the main problems is the lack of communication between these two
di erent approaches, due to substantial di erences characterizing them: on
the one hand there is a strong focus on how language works and on language
characteristics, there is a tendency to analytical and manual work. From other
hand, engineering perspective nds in language an obstacle, and recognizes in
the algorithms the fastest way to overcome this problem.
However, the lack of communication is not only an incompatibility: following
Harris, the best way to approach natural language, could result by taking the
best of both.
At the moment, there is a large number of open-source tools that perform
text analysis and Natural Language Processing. A great part of these tools are
based on statistical models and consist on separated modules which could be
combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User
Interface) and they result impossible to use for users without programming
skills. Furthermore, the vast majority of these open-source tools support only
English language and, when Italian language is included, the performances
of the tools decrease signi cantly. On the other hand, open source tools for
Italian language are very few.
In this work we want to ll this gap by present a new hybrid framework
for the analysis of Italian texts. It must not be intended as a commercial tool,
but the purpose for which it was built is to help linguists and other scholars to
perform rapid text analysis and to produce linguistic data. The framework,
that performs both statistical and rule-based analysis, is called LG-Starship.
The idea is to built a modular software that includes, in the beginning, the
basic algorithms to perform di erent kind of analysis. Modules will perform
the following tasks:
Preprocessing Module: a module with which it is possible to charge a
text, normalize it or delete stop-words. As output, the module presents
the list of tokens and letters which compose the texts with respective
occurrences count and the processed text.
Mr. Ling Module: a module with which POS tagging and Lemmatization
are performed. The module also returns the table of lemmas
with the count of occurrences and the table with the quanti cation of
grammatical tags.
Statistic Module: with which it is possible to calculate Term Frequency
and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units
and export results as tables.
Semantic Module: which use The Hyperspace Analogue to Language
algorithm to calculate semantic similarity between words. The module
returns similarity matrices of words per word which can be exported
and analyzed.
SyntacticModule: which analyze syntax structures of a selected sentence
and tag the verbs and its arguments with semantic labels.
The objective of the Framework is to build an all-in-one platform for NLP
which allows any kind of users to perform basic and advanced text analysis.
With the purpose of make the Framework accessible to users who have not
speci c computer science and programming language skills, the modules have
been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained
in the previous lines, it uses both statistical and rule/based methods, by relying
on standard statistical algorithms or techniques, and, at the same time,
on Lexicon-Grammar syntactic theory. In addition, it has been written in
both Java and Python programming languages. LG-Starship Framework has
a simple Graphic User Interface but will be also released as separated modules
which may be included in any NLP pipelines independently.
There are many resources of this kind, but the large majority works for English.
There are very few free resources for Italian language and this work tries
to cover this need by proposing a tool which can be used both by linguists
or other scientist interested in language and text analysis who have no idea
about programming languages, as by computer scientists, who can use free
modules in their own code or in combination with di erent NLP algorithms.
The Framework takes the start from a text or corpus written directly by
the user or charged from an external resource. The LG-Starship Framework
work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original
imported or generated text in order to produce a clean and normalized preprocessed
text. This module includes a function for text splitting, a stop-word
list and a tokenization method. On the text preprocessed the Statistic Module
or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces
as output databases of lexical and numerical data which can be used to
produce charts or perform more external analysis; the second one, is divided
in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?]
and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of-
Speech Tagging and produce an annotated text. A lemmatization method,
which relies on a set of electronic dictionaries developed at the University of
Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and
produces a new lemmatized version of original text with information about
syntactic and semantic properties.
This lemmatized text, which can also be processed with the Statistic Module,
serves as input for two deeper level of text analysis carried out by both
the Syntactic Module and the Semantic Module.
The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and
use a database of Predicate structures in development at the Department of
Political, Social and Communication Science. Its objective is to produce a
Dependency Graph of the sentences that compose the text.
The Semantic Module uses the Hyperspace Analogue to Language distributional
semantics algorithm [Lund and Burgess, 1996] trained on the Paisà
Corpus to produce a semantic network of the words of the text.
These work ow has been included in two di erent experiments in which
two User Generated Corpora have been involved.
The rst experiment represent a statistical study of the language of Rap
Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded
from on line databases of user generated lyrics.
The second experiment is a Feature-Based Sentiment Analysis project performed
on user product reviews. For this project we integrated a large domain
database of linguistic resources for Sentiment Analysis, developed in the past
years by the Department of Political, Social and Communication Science of
the University of Salerno, which consists of polarized dictionaries of Verbs,
Adjectives, Adverbs and Nouns.
These two experiment underline how the linguistic framework can be applied
to di erent level of analysis and to produce both Qualitative data and Quantitative
data.
For what concern the obtained results, the Framework, which is only at
a Beta Version, obtain discrete results both in terms of processing time that
in terms of precision. Nevertheless, the work is far from being considered
complete. More algorithms will be added to the Statistic Module and the
Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version
of the modules will be published. [edited by author]XV n.s
Using Natural Language Processing to Mine Multiple Perspectives from Social Media and Scientific Literature.
This thesis studies how Natural Language Processing techniques can be used to mine perspectives from textual data. The first part of the thesis focuses on analyzing the text exchanged by people who participate in discussions on social media sites. We particularly focus on threaded discussions that discuss ideological and political topics. The goal is to identify the different viewpoints that the discussants have with respect to the discussion topic. We use subjectivity and sentiment analysis techniques to identify the attitudes that the participants carry toward one another and toward the different aspects of the discussion topic. This involves identifying opinion expressions and their polarities, and identifying the targets of opinion. We use this information to represent discussions in one of two representations: discussant attitude vectors or signed attitude networks. We use data mining and network analysis techniques to analyze these representations to detect rifts in discussion groups and study how the discussants split into subgroups with contrasting opinions.
In the second part of the thesis, we use linguistic analysis to mine scholars perspectives from scientific literature through the lens of citations. We analyze the text adjacent to reference anchors in scientific articles as a means to identify researchers' viewpoints toward previously published work. We propose methods for identifying, extracting, and cleaning citation text. We analyze this text to identify the purpose (author's intention) and polarity (author's sentiment) of citation. Finally, we present several applications that can benefit from this analysis such as generating multi-perspective summaries of scientific articles and predicting future prominence of publications.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99934/1/amjbara_1.pd
Predicate Matrix: an interoperable lexical knowledge base for predicates
183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas
Named Entities Recognition for Machine Translation: A Case Study on the Importance of Named Entities for Customer Support
The last two decades have been of significant change in the international
panorama at all levels. The onset of the internet and content availability has propelled
us to a new era: The Information Age.
The staggering growth of new digital contents, either in the form of ebooks,
on-demand TV shows, blogs or even e-commerce websites, has led to an increase in
the need for translated material, influenced by people's demand for a quick access to
this shared knowledge in their native languages and dialects. Fortunately, machine
translation technologies (MT), which provide in many cases human-like translations,
are now more widely available, enabling quicker translations for multiple languages at
more affordable prices.
This work describes the Natural Language Process (NLP) sub-task known as
Named Entity Recognition (NER), performed by Unbabel, a Portuguese
Machine-translation start-up that combines MT with human post-edition and focuses
strictly on customer service content, to improve translation quality outputs. The main
objective of this study is to contribute to furthering MT quality and good-practices by
exposing the importance of having a continuously-in-development robust Named
Entity Recognition system for generic and client-specific content in an MT pipeline
and for General Data Protection Regulation (GDPR) compliance; moreover, having in
mind future applications, we have tested strategies that support the creation of
Multilingual Named Entities Recognition Systems.
In the following work, we will first define the meaning of Named Entity,
highlighting its importance in a Machine Translation scenario, followed by a brief
historical overview of the subject. We will also provide a reasonable description of the
most recent data-driven Machine Translation technologies.
Concerning the main topic of this work, we will describe three experiments
carried out jointly with Unbabel´s NLP team. The first experiment focuses on assisting
the NLP team in the creation of a domain-specific Named Entity Recognition (NER)
system. The second and third experiments explore the possibilities to create in a
semi-automatically fashion multilingual NER gold standards, by resorting to aligners
able to project Named Entities between a parallel corpus.As últimas duas décadas têm sido de grandes mudanças a todos os níveis. O
início da internet e a disponibilidade de conteúdos veio impulsionar-nos para uma nova
era: a Era da Informação.
O impressionante aumento de novos conteúdos digitais, sejam eles em forma de
ebooks, programas de televisão sempre disponíveis quando solicitados, blogs ou
mesmo sites na internet de vendas ao público, levou a um aumento de material
traduzido, influenciado em grande parte pelo facto de as pessoas exigirem um acesso
rápido a estes conhecimentos partilhados nas suas línguas nativas ou dialetos.
Felizmente, as novas tecnologias de tradução automática (TA), que em muitos casos
apresentam uma qualidade que rivaliza com as traduções humanas, estão agora
amplamente disponíveis, permitindo traduções para uma panóplia de diferentes
línguas, em tempo recorde e a melhores preços do que os praticados por tradutores
humanos.
O presente trabalho dedica-se a descrever a sub-tarefa no campo de
Processamento de Língua Natural (PLN) denominada de Reconhecimento de Entidades
Mencionadas (REM), utilizada pela Unbabel, uma startup portuguesa que combina
tradução automática com pós-edição humana, de forma a melhorar a qualidade das
traduções automáticas, e que se foca principalmente em conteúdos provenientes da
área do apoio ao cliente. O principal objetivo deste trabalho é contribuir para um
crescente aumento da qualidade das traduções automáticas e para fomentar as boas
práticas na área da tradução automática, expondo a importância de manter um sistema
de Reconhecimento de Entidades Mencionadas robusto e em constante evolução no
seu ciclo de tradução, capaz de articular diferentes tipos de conteúdo, do mais genérico
ao mais específico, e para cumprir as disposições sobre a proteção de dados exigidas
pelo Regulamento Geral sobre a Proteção de Dados (RGPD); adicionalmente, e tendo
em conta possíveis aplicações futuras, foram testadas estratégias inovadoras que
permitem e fomentam a criação de um sistema de Reconhecimento de Entidades
Mencionadas multilíngue. No presente documento, iremos primeiro definir o significado de Entidade
Mencionada, explicitando a sua importância num contexto de tradução automática.
Num segundo momento, será dada uma panorâmica histórica sobre o tema.
Adicionalmente, também iremos fazer um enquadramento histórico sobre os próprios
sistemas de tradução automáticos, com um especial foco nas mais recentes tecnologias
desenvolvidas com base em dados e sistemas de Inteligência Artificial.
No que se refere ao tema principal do nosso trabalho, iremos descrever as três
experiências levadas a cabo durante o estágio na Unbabel. Todas as experiências
efetuadas tiveram como base os dados reais de clientes dos mais diversos domínios,
com cada corpus utilizado nas experiências, sendo selecionados de acordo com os
objetivos finais de cada experiência.
A primeira experiência, que teve como objetivo auxiliar a equipa de
Inteligência Artificial da Unbabel a desenvolver e testar um sistema automático de
Reconhecimento de Entidades Mencionadas na área da entrega de comida ao
domicílio, previu a possibilidade futura de se conseguir adaptar estes tipos de sistema a
qualquer domínio ou clientes específicos. Com esta experiência foram dados os
primeiros passos na Unbabel para a criação de um sistemas de Reconhecimento de
Entidades de domínio específico.
Em relação ao trabalho desenvolvido, começámos por apresentar e testar uma
metodologia de identificação de tipos de Entidades Mencionadas comuns ao domínio
acima mencionado. Neste sentido, um extenso corpus na área foi compilado e
analisado, sendo possível identificar quatro tipos, e.g., categorias, de Entidades
Mencionadas relevantes para o domínio, Restaurant Names; Restaurant Chains; Dish
Names; Beverages. Posteriormente, foram criadas diretrizes de anotação para cada
nova categoria, acabando estas por serem adicionadas à tipologia de anotação de
Entidades Mencionadas já existente na Unbabel, incluindo 27 EM de foro mais
genérico, tais como: Localização; Moedas; Medidas; Endereços; Produtos e Serviços e
Cidades. Num segundo momento, foi feita uma tarefa de anotação sobre um novo
corpus da mesma área composto por 14426 frases, com vista à construção de gold
standards, a serem utilizados para a aprendizagem dos sistemas automáticos de
Reconhecimento de Entidades Mencionadas e para testar os resultados dos mesmos.
Para esta tarefa, fizemos uso das novas diretrizes, permitindo testá-las.
Dois modelos foram treinados, um com apenas o gold standard do domínio
específico, o outro com o gold standard do domínio específico e com todas as
anotações de Entidades Mencionadas disponíveis. Desta forma, foi possível determinar
qual dos dois obteve melhores resultados.
No que se refere aos resultados obtidos, determinou-se que o gold standard do
domínio específico não apresentava exemplos suficientes para treino e teste do novo
Sistema de Reconhecimento de Entidades Mencionadas. Mesmo assim, foi possível
obter resultados referentes à categoria Dish Names, que permitiram concluir que de
ambos os modelos, aquele treinado com o gold standard do domínio específico
conseguiu obter melhores resultados, identificando mais Dish Names de forma correta
no corpus de teste.
A segunda experiência focou-se em testar a estratégia de criação automática de
gold standards multilíngues de Entidades Mencionadas para aprendizagem de sistemas
automáticos, recorrendo a sistemas de alinhamentos de Entidades Mencionadas em
bitextos (textos paralelos bilíngues). Para esta experiência foi usado um corpus em
inglês (EN) traduzido para alemão (DE) na área do Turismo com 2500 frases e quatro
sistemas de alinhamento de palavras de última geração.
Em relação a esta experiência, começamos por submeter o corpus traduzido
(DE) a um processo de anotação manual de Entidades Mencionadas, utilizando para tal
as diretrizes de anotação de Entidades Mencionadas da Unbabel, sendo que para esta
experiência não foram consideradas as novas Entidades da primeira experiência. Com
a anotação do corpus traduzido feita, foi então possível enviá-lo para alinhamento de
Entidades Mencionadas com o corpus homólogo (EN), que havia sido previamente
anotado por outro anotator. Os resultados de alinhamento das entidades Mencionadas do bitexto permitiu
avaliar o Named Entities inter-annotator agreement, ou seja o valor de acordo entre
anotadores, no que se refere à seleção e categorização das diferentes Entidades, de
forma a perceber que Entidades apresentam mais dificuldades de anotação.
Adicionalmente, com os resultados de alinhamento foi possível determinar o sistema
de alinhamento com melhores resultados de entre os quatro sistemas analisados
(SimAlign; FastAlign; AwesomeAlign; eflomal).
Os resultados de anotação mostraram uma elevada percentagem de
inter-annotator agreement, com 87,97% de concordância para algumas categorias. .
Adicionalmente, os resultados de alinhamento permitiram estabelecer o SimAlign como
o sistema de alinhamento mais eficaz e preciso, suplantando o sistema utilizado pela
Unbabel, FastAlign.
A terceira experiência replicou o processo acima descrito, desta vez usando um
bitexto (EN e PT-BR) composto por 360 frases na área da tecnologia Com esta nova
experiência, pretendeu-se verificar se os resultados de alinhamento obtidos para o
corpus de Turismo EN/DE são replicáveis quando se altera o domínio e os pares de
língua. Esta experiência, à semelhança da anterior, previu uma tarefa de anotação de
Entidades Mencionadas do corpus em questão (EN e PT-BR), sendo utilizadas as
mesmas diretrizes de anotação da anterior experiência. Num segundo momento, o
bitexto anotado foi então enviado para alinhamento, sendo utilizados os mesmos
sistemas de alinhamento da segunda experiência.
Com base nos resultados da experiência, foi possível determinar para cada
Entidade Mencionada quais os sistema de alinhamento que obtiveram melhores
resultados. Desta análise chegou-se à conclusão de que o sistema de alinhamento
automático AwesomeAlign foi o que apresentou melhores resultados, seguido pelo
SimAlign, que apresentou um desempenho de alinhamento mais baixo para a categoria
de Entidade Mencionadas: Organizações. Em conclusão, com este trabalho pretendemos mostrar a complexidade e
importância inerentes às Entidades Mencionadas num pipeline de tradução automática,
assim como mostrar a importância de sistemas de reconhecimento de Entidades
Mencionadas robusto e adaptável. É expectável que sistemas de Reconhecimento de
Entidades Mencionadas treinados com foco em domínios particulares, consigam
melhores resultados do que aqueles treinados com dados mais genéricos.
De igual forma, salientamos a possibilidade e aplicabilidade de se poder usar
diferentes recursos da área de Processamento de Língua Natural, como o uso de
sistemas de alinhamento, no auxílio de Reconhecimento de Entidades Mencionadas,
como nos casos acima descritos.
De uma perspectiva mais linguística, atendemos a questões relacionadas com
Entidades Mencionadas ambíguas. Neste ponto, estabeleceu-se quais as entidades que
apresentam uma maior variabilidade de anotação entre anotadores, ou seja, aquelas em
que houve um maior desacordo entre anotadores no que se refere às suas
classificações, tentando encontrar justificações e soluções para este problema
Detecting subjectivity through lexicon-grammar. strategies databases, rules and apps for the italian language
2014 - 2015The present research handles the detection of linguistic phenomena connected to subjectivity, emotions and opinions from a computational point of view.
The necessity to quickly monitor huge quantity of semi-structured and unstructured data from the web, poses several challenges to Natural Language Processing, that must provide strategies and tools to analyze their structures from a lexical, syntactical and semantic point of views.
The general aim of the Sentiment Analysis, shared with the broader fields of NLP, Data Mining, Information Extraction, etc., is the automatic extraction of value from chaos; its specific focus instead is on opinions rather than on factual information. This is the aspect that differentiates it from other computational linguistics subfields.
The majority of the sentiment lexicons has been manually or automatically created for the English language; therefore, existent Italian lexicons are mostly built through the translation and adaptation of the English lexical databases, e.g. SentiWordNet and WordNet-Affect.
Unlike many other Italian and English sentiment lexicons, our database SentIta, made up on the interaction of electronic dictionaries and lexicon dependent local grammars, is able to manage simple and multiword structures, that can take the shape of distributionally free structures, distributionally restricted structures and frozen structures.
Moreover, differently from other lexicon-based Sentiment Analysis methods, our approach has been grounded on the solidity of the Lexicon-Grammar resources and classifications, that provides fine-grained semantic but also syntactic descriptions of the lexical entries.
According with the major contribution in the Sentiment Analysis literature, we did not consider polar words in isolation. We computed they elementary sentence contexts, with the allowed transformations and, then, their interaction with contextual valence shifters, the linguistic devices that are able to modify the prior polarity of the words from SentIta, when occurring with them in the same sentences. In order to do so, we took advantage of the computational power of the finite-state technology. We formalized a set of rules that work for the intensification, downtoning and negation modeling, the modality detection and the analysis of comparative forms.
With regard to the applicative part of the research, we conducted, with satisfactory results, three experiments on the same number of Sentiment Analysis subtasks: the sentiment classification of documents and sentences, the feature-based Sentiment Analysis and the Semantic Role Labeling based on sentiments. [edited by author]XIV n.s
Semantic Systems. The Power of AI and Knowledge Graphs
This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies
Compositional language processing for multilingual sentiment analysis
Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] This dissertation presents new approaches in the field of sentiment
analysis and polarity classification, oriented towards obtaining the sentiment
of a phrase, sentence or document from a natural language
processing point of view. It makes a special emphasis on methods
to handle semantic composionality, i. e. the ability to compound the
sentiment of multiword phrases, where the global sentiment might
be different or even opposite to the one coming from each of their
their individual components; and the application of these methods to
multilingual scenarios.
On the one hand, we introduce knowledge-based approaches to calculate
the semantic orientation at the sentence level, that can handle
different phenomena for the purpose at hand (e. g. negation, intensification
or adversative subordinate clauses).
On the other hand, we describe how to build machine learning
models to perform polarity classification from a different perspective,
combining linguistic (lexical, syntactic and semantic) knowledge,
with an emphasis in noisy and micro-texts.
Experiments on standard corpora and international evaluation campaigns
show the competitiveness of the methods here proposed, in
monolingual, multilingual and code-switching scenarios.
The contributions presented in the thesis have potential applications
in the era of the Web 2.0 and social media, such as being able to
determine what is the view of society about products, celebrities or
events, identify their strengths and weaknesses or monitor how these
opinions evolve over time. We also show how some of the proposed
models can be useful for other data analysis tasks.[Resumen] Esta tesis presenta nuevas técnicas en el ámbito del análisis del sentimiento
y la clasificación de polaridad, centradas en obtener el sentimiento
de una frase, oración o documento siguiendo enfoques basados en
procesamiento del lenguaje natural. En concreto, nos centramos en
desarrollar métodos capaces de manejar la semántica composicional,
es decir, con la capacidad de componer el sentimiento de oraciones
donde la polaridad global puede ser distinta, o incluso opuesta, de la
que se obtendría individualmente para cada uno de sus términos; y
cómo dichos métodos pueden ser aplicados en entornos multilingües.
En la primera parte de este trabajo, introducimos aproximaciones
basadas en conocimiento para calcular la orientación semántica a nivel
de oración, teniendo en cuenta construcciones lingüísticas relevantes
en el ámbito que nos ocupa (por ejemplo, la negación, intensificación,
o las oraciones subordinadas adversativas).
En la segunda parte, describimos cómo construir clasificadores de
polaridad basados en aprendizaje automático que combinan información
léxica, sintáctica y semántica; centrándonos en su aplicación sobre
textos cortos y de pobre calidad gramatical.
Los experimentos realizados sobre colecciones estándar y competiciones
de evaluación internacionales muestran la efectividad de los
métodos aquí propuestos en entornos monolingües, multilingües y
de code-switching.
Las contribuciones presentadas en esta tesis tienen diversas aplicaciones
en la era de la Web 2.0 y las redes sociales, como determinar la
opinión que la sociedad tiene sobre un producto, celebridad o evento;
identificar sus puntos fuertes y débiles o monitorizar cómo estas opiniones
evolucionan a lo largo del tiempo. Por último, también mostramos
cómo algunos de los modelos propuestos pueden ser útiles
para otras tareas de análisis de datos.[Resumo] Esta tese presenta novas técnicas no ámbito da análise do sentimento
e da clasificación da polaridade, orientadas a obter o sentimento dunha
frase, oración ou documento seguindo aproximacións baseadas
no procesamento da linguaxe natural. En particular, centrámosnos
en métodos capaces de manexar a semántica composicional: métodos
coa habilidade para compor o sentimento de oracións onde o sentimento
global pode ser distinto, ou incluso oposto, do que se obtería
individualmente para cada un dos seus términos; e como ditos métodos
poden ser aplicados en entornos multilingües.
Na primeira parte da tese, introducimos aproximacións baseadas
en coñecemento; para calcular a orientación semántica a nivel de oración,
tendo en conta construccións lingüísticas importantes no ámbito
que nos ocupa (por exemplo, a negación, a intensificación ou as oracións
subordinadas adversativas).
Na segunda parte, describimos como podemos construir clasificadores
de polaridade baseados en aprendizaxe automática e que combinan
información léxica, sintáctica e semántica, centrándonos en textos
curtos e de pobre calidade gramatical.
Os experimentos levados a cabo sobre coleccións estándar e competicións
de avaliación internacionais mostran a efectividade dos métodos
aquí propostos, en entornos monolingües, multilingües e de
code-switching.
As contribucións presentadas nesta tese teñen diversas aplicacións
na era da Web 2.0 e das redes sociais, como determinar a opinión que
a sociedade ten sobre un produto, celebridade ou evento; identificar
os seus puntos fortes e febles ou monitorizar como esas opinións
evolucionan o largo do tempo. Como punto final, tamén amosamos
como algúns dos modelos aquí propostos poden ser útiles para outras
tarefas de análise de datos
Text mining and natural language processing for the early stages of space mission design
Final thesis submitted December 2021 - degree awarded in 2022A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes.A considerable amount of data related to space mission design has been accumulated
since artificial satellites started to venture into space in the 1950s. This data has today
become an overwhelming volume of information, triggering a significant knowledge
reuse bottleneck at the early stages of space mission design. Meanwhile, virtual assistants,
text mining and Natural Language Processing techniques have become pervasive
to our daily life.
The work presented in this thesis is one of the first attempts to bridge the gap
between the worlds of space systems engineering and text mining. Several novel models
are thus developed and implemented here, targeting the structuring of accumulated
data through an ontology, but also tasks commonly performed by systems engineers
such as requirement management and heritage analysis. A first collection of documents
related to space systems is gathered for the training of these methods. Eventually, this
work aims to pave the way towards the development of a Design Engineering Assistant
(DEA) for the early stages of space mission design. It is also hoped that this work will
actively contribute to the integration of text mining and Natural Language Processing
methods in the field of space mission design, enhancing current design processes
Knowledge Modelling and Learning through Cognitive Networks
One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot
Learning to represent, categorise and rank in community question answering
The task of Question Answering (QA) is arguably one of the oldest tasks in Natural Language Processing, attracting high levels of interest from both industry and academia. However, most research has focused on factoid questions, e.g. Who is the president of Ireland? In contrast, research on answering non-factoid questions, such as manner, reason, difference and opinion questions, has been rather piecemeal.
This was largely due to the absence of available labelled data for the task. This is changing, however, with the growing popularity of Community Question Answering (CQA) websites, such as Quora, Yahoo! Answers and the Stack Exchange family of forums. These websites provide natural labelled data allowing us to apply machine learning techniques.
Most previous state-of-the-art approaches to the tasks of CQA-based question answering involved handcrafted features in combination with linear models. In this thesis we hypothesise that the use of handcrafted features can be avoided and the tasks can be approached with representation learning techniques, specifically deep learning.
In the first part of this thesis we give an overview of deep learning in natural language processing and empirically evaluate our hypothesis on the task of detecting semantically equivalent questions, i.e. predicting if two questions can be answered by the same answer.
In the second part of the thesis we address the task of answer ranking, i.e. determining how suitable an answer is for a given question. In order to determine the suitability of representation learning for the task of answer ranking, we provide a rigorous experimental evaluation of various neural architectures, based on feedforward, recurrent and convolutional neural networks, as well as their combinations.
This thesis shows that deep learning is a very suitable approach to CQA-based QA, achieving state-of-the-art results on the two tasks we addressed