108 research outputs found
Rapport : a fact-based question answering system for portuguese
Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered
more common these days, the same still does not happen regarding access to specific
textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user
questions with small passages or short answers.
The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level,
although the syntactic aspects of natural language have well known rules, the size and
complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest
tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the
selected text that may yield the answer to a given question must be further processed
in order to present just a passage instead of the full text. These issues take also longer
to address in languages other than English, as is the case of Portuguese, that have a lot
less people working on them.
This work focuses on question answering for Portuguese. In other words, our field
of interest is in the presentation of short answers, passages, and possibly full sentences,
but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information
extraction techniques for extracting triples, so called facts, characterizing information
on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other
metadata, constitute the basis of the answers presented by the system. Facts work both
by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the
form of small passages. As for the results, although there is margin for improvement,
they are a tangible proof of the adequacy of our approach and its different modules for
storing information and retrieving answers in question answering systems.
In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to
question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT,
which was built from scratch, and has a high accuracy. Many of these tools result from
the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their
input, post-processing their output, or both, and by training models for use in those
tools or other, such as MaltParser. Other tools include the creation of interfaces for
other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules
BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams
One common trend in recent studies of language models (LMs) is the use of
standardized tests for evaluation. However, despite being the fifth most spoken
language worldwide, few such evaluations have been conducted in Portuguese.
This is mainly due to the lack of high-quality datasets available to the
community for carrying out evaluations in Portuguese. To address this gap, we
introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset
of entrance exams from the two leading universities in Brazil: UNICAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP
models on a variety of subjects. Furthermore, BLUEX includes a collection of
recently administered exams that are unlikely to be included in the training
data of many popular LMs as of 2023. The dataset is also annotated to indicate
the position of images in each question, providing a valuable resource for
advancing the state-of-the-art in multimodal language understanding and
reasoning. We describe the creation and characteristics of BLUEX and establish
a benchmark through experiments with state-of-the-art LMs, demonstrating its
potential for advancing the state-of-the-art in natural language understanding
and reasoning in Portuguese. The data and relevant code can be found at
https://github.com/Portuguese-Benchmark-Datasets/BLUE
Assisting Forensic Identification through Unsupervised Information Extraction of Free Text Autopsy Reports: The Disappearances Cases during the Brazilian Military Dictatorship
Anthropological, archaeological, and forensic studies situate enforced disappearance as a strategy associated with the Brazilian military dictatorship (1964–1985), leaving hundreds of persons without identity or cause of death identified. Their forensic reports are the only existing clue for people identification and detection of possible crimes associated with them. The exchange of information among institutions about the identities of disappeared people was not a common practice. Thus, their analysis requires unsupervised techniques, mainly due to the fact that their contextual annotation is extremely time-consuming, difficult to obtain, and with high dependence on the annotator. The use of these techniques allows researchers to assist in the identification and analysis in four areas: Common causes of death, relevant body locations, personal belongings terminology, and correlations between actors such as doctors and police officers involved in the disappearances. This paper analyzes almost 3000 textual reports of missing persons in São Paulo city during the Brazilian dictatorship through unsupervised algorithms of information extraction in Portuguese, identifying named entities and relevant terminology associated with these four criteria. The analysis allowed us to observe terminological patterns relevant for people identification (e.g., presence of rings or similar personal belongings) and automate the study of correlations between actors. The proposed system acts as a first classificatory and indexing middleware of the reports and represents a feasible system that can assist researchers working in pattern search among autopsy reportsThis research was partially funded by Spanish Ministry of Economy, Industry and 5 Competitiveness under its Competitive Juan de la Cierva Postdoctoral Research Programme, grant FJCI-2016-6 28032 and from the European Union, through the Marie Skłodowska-Curie Innovative Training Network ‘CHEurope: Critical Heritage Studies and the Future of Europe’ H2020 Marie Skłodowska-Curie Actions, grant 722416S
Semi-automatic approaches for exploiting shifter patterns in domain-specific sentiment analysis
This paper describes two different approaches to sentiment analysis. The first is a form of symbolic approach that exploits a sentiment lexicon together with a set of shifter patterns and rules. The sentiment lexicon includes single words (unigrams) and is developed automatically by exploiting labeled examples. The shifter patterns include intensification, attenuation/downtoning and inversion/reversal and are developed manually. The second approach exploits a deep neural network, which uses a pre-trained language model. Both approaches were applied to texts on economics and finance domains from newspapers in European Portuguese. We show that the symbolic approach achieves virtually the same performance as the deep neural network. In addition, the symbolic approach provides understandable explanations, and the acquired knowledge can be communicated to others. We release the shifter patterns to motivate future research in this direction
Modelling semantic relations with distributitional semantics and deep learning : question answering, entailment recognition and paraphrase detection
Nesta dissertação apresenta-se uma abordagem à tarefa de modelar relações semânticas
entre dois textos com base em modelos de semântica distribucional e em aprendizagem
profunda. O presente trabalho tira partido de várias disciplinas da ciência
cognitiva, com especial relevo para a computação, a linguística e a inteligência artificial,
e com fortes influência da neurociência e da psicologia cognitiva.
Os modelos de semântica distribucional (também conhecidos como ”word embeddings”)
são usados para representar o significado das palavras. As representações
semânticas das palavras podem ainda ser combinadas para obter o significado de
um excerto de um texto recorrendo ao uso da aprendizagem profunda, isto é, com o
apoio das redes neurais de convolução.
Esta abordagen é utilizada para replicar a experiência realizada por Bogdanova
et al. (2015) na tarefa de deteção de perguntas que podem ser respondidas as mesmas
respostas tal como estas foram respondidas em fóruns on-line. Os resultados do
desempenho obtidos pelas experiências apresentadas nesta dissertação são equivalentes
ou melhores que os resultados obtidos no trabalho de referência mencionado
acima.
Apresentao também um estudo sobre o impacto do pré-processamento apropriado
do texto, tendo em conta os resultados que podem ser obtidos pelas abordagens
adotadas no trabalho de referência supramencionado. Este estudo é levado a cabo
removendo-se certas pistas que podem levar o sistema, indevidamente, a detetar
perguntas equivalentes. Essa remoção das pistas leva a uma diminuição significativa
no desempenho do sistema desenvolvido no trabalho de referência.
Nesta dissertação é ainda apresentado um estudo sobre o impacto que os word
embeddings treinados previamente têm na tarefa de detetar perguntas semanticamente
equivalentes. Substituindo-se, aleatoriamente, word embeddings previamente
treinados por outros melhora-se o desempenho do sistema.
Além disso, o modelo foi utilizado na tarefa de reconhecimento de implicações
para Português, onde mostrou uma taxa de acerto similar à da baseline. Este trabalho também reporta os resultados da aplicação da abordagem adotada
numa competição para a deteção de paráfrases em Russo. A configuração final apresenta
duas melhorias: usa character embeddings em vez de word embeddings e usa
vários filtros de convolução. Esta configuração foi testado na execução padrão da
Tarefa 2 da competição relevante, e mostrou resultados competitivos.This dissertation presents an approach to the task of modelling semantic relations between
two texts, which is based on distributional semantic models and deep learning.
The present work takes advantage of various disciplines of cognitive science, mainly
computation, linguistics and artificial intelligence, with strong influences from neuroscience
and cognitive psychology.
Distributional semantic models (also known as word embeddings) are used to
represent the meaning of words. Word semantic representations can be further combined
towards obtaining the meaning of a larger chunk of a text using a deep learning
approach, namely with the support of convolutional neural networks.
These approaches are used to replicate the experiment carried out, by Bogdanova
et al. (2015), for the task of detecting questions that can be answered by exactly the
same answer in online user forums. Performance results obtained by my experiments
are comparable or better than the ones reported in that referenced work.
I present also a study on the impact of appropriate text preprocessing with respect
to the results that can be obtained by the approaches adopted in that referenced
work. Removing certain clues that can unduly help the system to detect equivalent
questions leads to a significant decrease in system’s performance supported by that
referenced work.
I also present a study of the impact that pre-trained word embeddings have in the
task of detecting the semantically equivalent questions. Replacing pre-trained word
embeddings by randomly initialised ones improves the performance of the system.
Additionally, the model was applied to the task of entailment recognition for Portuguese
and showed an accuracy on a level with the baseline.
This dissertation also reports on the results of an experimental study on the application
of the adopted approach to the shared task of sentence paraphrase detection
in Russian. The final set up contained two improvements: it uses several convolutional
filters and it uses character embeddings instead of word embeddings. It was tested in Task 2 standard run of the relevant shared task and it showed competitive
results
NILC-Metrix : assessing the complexity of written and spoken language in Brazilian Portuguese
This paper presents and makes publicly available the NILC-Metrix, a computational system comprising 200 metrics proposed in studies on discourse, psycholinguistics, cognitive and computational linguistics, to assess textual complexity in Brazilian Portuguese (BP). These metrics are relevant for descriptive analysis and the creation of computational models and can be used to extract information from various linguistic levels of written and spoken language. The metrics in NILC-Metrix were developed during the last 13 years, starting in 2008 with Coh-Metrix-Port, a tool developed within the scope of the PorSimples project. Coh-Metrix-Port adapted some metrics to BP from the Coh-Metrix tool that computes metrics related to cohesion and coherence of texts in English. After the end of PorSimples in 2010, new metrics were added to the initial 48 metrics of Coh-Metrix-Port. Given the large number of metrics, we present them following an organisation similar to the metrics of Coh-Metrix v3.0 to facilitate comparisons made with metrics in Portuguese and English. In this paper, we illustrate the potential of NILC-Metrix by presenting three applications: (i) a descriptive analysis of the differences between children's film subtitles and texts written for Elementary School I and II (Final Years); (ii) a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; (iii) a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers. For each application, we evaluate which groups of metrics are more discriminative, showing their contribution for each task
- …