399 research outputs found

    Enriching the 1758 Portuguese Parish Memories (Alentejo) with Named Entities

    Get PDF
    This work presents an enriched version of the Parish Memories (1758–1761), an essential Portuguese historical source manually transcribed. It is enriched with annotations of named entities of the types PERSON, LOCATION, and ORGANIZATION. The annotation was done automatically for the whole collection where two researchers annotated a portion of it manually for evaluation purposes. In this dataset, we provide the tagged texts, the lists of extracted entities, and frequency counts. The corpus is useful for historians, allowing, for instance, comparative analyses between parishes and regions or to calculate the area of influence of a locality. The paper describes the creation and evaluation of the corpus, discusses its applications and limitations. This first release may be improved by other researchers interested in the historical source itself or in the technology employed in its annotation.FCT CEECIND/01997/2017, UIDB/00057/202

    Embeddings for Named Entity Recognition in Geoscience Portuguese Literature

    Get PDF
    This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.UIDB/00057/2020, CEECIND/01997/201

    Word Embedding Evaluation in Downstream Tasks and Semantic Analogies

    Get PDF
    Language Models have long been a prolific area of study in the field of Natural Language Processing (NLP). One of the newer kinds of language models, and some of the most used, are Word Embeddings (WE). WE are vector space representations of a vocabulary learned by a non-supervised neural network based on the context in which words appear. WE have been widely used in downstream tasks in many areas of study in NLP. These areas usually use these vector models as a feature in the processing of textual data. This paper presents the evaluation of newly released WE models for the Portuguese language, trained with a corpus composed of 4.9 billion tokens. The first evaluation presented an intrinsic task in which WEs had to correctly build semantic and syntactic relations. The second evaluation presented an extrinsic task in which the WE models were used in two downstream tasks: Named Entity Recognition and Semantic Similarity between Sentences. Our results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that passing the text in parts to the WE generating algorithm may cause loss of quality

    Recognizing Emotions in Short Texts

    Get PDF
    Tese de mestrado, Ciência Cognitiva, Universidade de Lisboa, Faculdade de Ciências, 2022O reconhecimento automático de emoções em texto é uma tarefa que mobiliza as áreas de processamento de linguagem natural e de computação afetiva, para as quais se pode contar com o especial contributo de disciplinas da Ciência Cognitiva como Inteligência Artificial e Ciência da Computação, Linguística e Psicologia. Visa, sobretudo, a deteção e interpretação de emoções humanas através da sua expressão na forma escrita por sistemas computacionais. A interação entre processos afetivos e cognitivos, o papel essencial que as emoções desempenham nas interações interpessoais e a crescente utilização de comunicação escrita online nos dias de hoje fazem com que o reconhecimento de emoções de forma automática seja cada vez mais importante, nomeadamente em áreas como saúde mental, interação pessoa-computador, ciência política ou marketing. A língua inglesa tem sido o maior alvo de estudo no que diz respeito ao reconhecimento de emoções em textos, sendo que ainda existe pouco trabalho desenvolvido para a língua portuguesa. Assim, existe uma necessidade em expandir o trabalho feito para a língua inglesa para o português. Esta dissertação tem como objetivo a comparação de dois métodos distintos de aprendizagem profunda resultantes dos avanços na área de Inteligência Artificial para detetar e classificar de forma automática estados emocionais discretos em textos escritos em língua portuguesa. Para tal, a abordagem de classificação de Polignano et al. (2019) baseada em redes de aprendizagem profunda como Long Short-Term Memory bidirecionais e redes convolucionais mediadas por um mecanismo de atenção será replicada para a língua inglesa e será reproduzida para a língua portuguesa. Para a língua inglesa, será utilizado o conjunto de dados da tarefa 1 do SemEval-2018 (Mohammad et al., 2018) tal como na experiência original, que considera quatro emoções discretas: raiva, medo, alegria e tristeza. Para a língua portuguesa, tendo em consideração a falta de conjuntos de dados disponíveis anotados relativamente a emoções, será efetuada uma recolha de dados a partir da rede social Twitter recorrendo a hashtags com conteúdo associado a uma emoção específica para determinar a emoção subjacente ao texto de entre as mesmas quatro emoções presentes no conjunto de dados da língua inglesa que será utilizado. De acordo com experiências realizadas por Mohammad & Kiritchenko (2015), este método de recolha de dados é consistente com a anotação de juízes humanos treinados. Tendo em conta a rápida e contínua evolução dos métodos de aprendizagem profunda para o processamento de linguagem natural e o estado da arte estabelecido por métodos recentes em tarefas desta área tal como o modelo pré-treinado BERT (Bidirectional Encoder Representations from Tranformers) (Devlin et al., 2019), será também aplicada esta abordagem para a tarefa de reconhecimento de emoções para as duas línguas em questão, utilizando os mesmos conjuntos de dados das experiências anteriores. Enquanto a abordagem de Polignano et al. teve um melhor desempenho nas experiências que realizámos com dados em inglês, com diferenças de F1-score de 0.02, o melhor resultado obtido nas experiências com dados na língua portuguesa foi com o modelo BERT, obtendo um resultado máximo de F1-score de 0.6124.Automatic emotion recognition from text is a task that mobilizes the areas of natural language processing and affective computing counting with the special contribution of Cognitive Science subjects such as Artificial Intelligence and Computer Science, Linguistics and Psychology. It aims at the detection and interpretation of human emotions expressed in the written form by computational systems. The interaction of affective and cognitive processes, the essential role that emotions play in interpersonal interactions and the currently increasing use of written communication online make automatic emotion recognition progressively important, namely in areas such as mental healthcare, human-computer interaction, political science, or marketing. The English language has been the main target of studies in emotion recognition in text and the work developed for the Portuguese language is still scarce. Thus, there is a need to expand the work developed for English to Portuguese. The goal of this dissertation is to present and compare two distinct deep learning methods resulting from the advances in Artificial Intelligence to automatically detect and classify discrete emotional states in texts written in Portuguese. For this, the classification approach of Polignano et al. (2019) based on deep learning networks such as bidirectional Long Short-Term Memory and convolutional networks mediated by a self-attention level will be replicated for English and it will be reproduced for Portuguese. For English, the SemEval-2018 task 1 dataset (Mohammad et al., 2018) will be used, as in the original experience, and it considers four discrete emotions: anger, fear, joy, and sadness. For Portuguese, considering the lack of available emotionally annotated datasets, data will be collected from the social network Twitter using hashtags associated to a specific emotional content to determine the underlying emotion of the text from the same four emotions present in the English dataset. According to experiments carried out by Mohammad & Kiritchenko (2015), this method of data collection is consistent with the annotation of trained human judges. Considering the fast and continuous evolution of deep learning methods for natural language processing and the state-of-the-art results achieved by recent methods in tasks in this area such as the pre-trained language model BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019), this approach will also be applied to the task of emotion recognition for both languages using the same datasets from the previous experiments. It is expected to draw conclusions about the adequacy of these two presented approaches in emotion recognition and to contribute to the state of the art in this task for the Portuguese language. While the approach of Polignano et al. had a better performance in the experiments with English data with a difference in F1 scores of 0.02, for Portuguese we obtained the best result with BERT having a maximum F1 score of 0.6124

    BERTimbau : modelos BERT pré-treinados para português brasileiro

    Get PDF
    Orientador: Roberto de Alencar Lotufo, Rodrigo Frassetto NogueiraDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Os avanços recentes em representação de linguagem usando redes neurais e aprendizado profundo permitiram que os estados internos aprendidos por grandes modelos de linguagem (ML) pré-treinados fossem usados no tratamento de outras tarefas finais de processamento de linguagem natural (PLN). Essa abordagem de transferência de aprendizado melhora a performance em diversas tarefas e é bastante benéfica quando há escassez de dados rotulados, fazendo com que MLs pré-treinados sejam recursos de grande utilidade, especialmente para línguas cujos conjuntos de dados de treinamento possuam poucos exemplos anotados. Nesse trabalho, nós treinamos modelos BERT (Bidirectional Encoder Representations from Transformers) para Português brasileiro, os quais apelidamos de BERTimbau. Nós avaliamos os modelos em três tarefas finais de PLN: similaridade semântica, inferência textual e reconhecimento de entidades nomeadas. Nossos modelos desempenham melhor do que o estado da arte em todas essas tarefas, superando o BERT multilíngue e confirmando a efetividade de grandes MLs para Português. Nós disponibilizamos nossos modelos para a comunidade de modo a promover boas bases de comparação para pesquisas futuras em PLNAbstract: Recent advances in language representation using neural networks and deep learning have made it viable to transfer the learned internal states of large pretrained language models (LMs) to downstream natural language processing (NLP) tasks. This transfer learning approach improves the overall performance on many tasks and is highly beneficial whenlabeled data is scarce, making pretrained LMs valuable resources specially for languages with few annotated training examples. In this work, we train BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which we nickname BERTimbau. We evaluate our models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition. Our models improve the state-of-the-art in all of these tasks, outperforming Multilingual BERT and confirming the effectiveness of large pretrained LMs for Portuguese. We release our models to the community hoping to provide strong baselines for future NLP researchMestradoEngenharia de ComputaçãoMestre em Engenharia Elétric

    Digital Humanities and Portuguese Processing: a research pathway

    Get PDF
    This paper reflects on the whole path of work in digital humanities, on the light of the projects related to text processing under development at CIDEHUS. These projects deal with a rich heritage related to the Portuguese culture, history and language. This paper reflects on the many challenges to be faced and how NLP techniques may broaden the capabilities of organising and sharing knowledge related to these resources

    Fake news classification in European Portuguese language

    Get PDF
    All over the world, many initiatives have been taken to fight fake news. Governments (e.g., France, Germany, United Kingdom and Spain), on their own way, started to take actions regarding legal accountability for those who manufacture or propagate fake news. Different media outlets have also taken plenty initiatives to deal with this phenomenon, such as the increase of the discipline, accuracy and transparency of publications made internally. Some structural changes have been made in those companies and in other entities in order to evaluate news in general. Many teams were built entirely to fight fake news, the so-called “fact-checkers”. Those teams have been adopting different types of techniques in order to do those tasks: from the typical use of journalists, to find out the true behind a controversial statement, to data-scientists, in order to apply forefront techniques such as text mining, and machine learning to support journalist’s decisions. Many of those entities, which aim to maintain or rise their reputation, started to focus on high standards of quality and reliable information, which led to the creation of official and dedicated departments of fact-checking. In the first part of this work, we contextualize European Portuguese language regarding fake news detection and classification, against the current state-of-the-art. Then, we present an end-to-end solution to easily extract and store previously classified European Portuguese news. We used the extracted data to apply some of the most used text minning and machine learning techniques, presented in the current state-of-the-art, in order to understand and evaluate possible limitations of those techniques, in this specific context.Um pouco por todo o mundo foram tomadas várias iniciativas para combater fake news. Muitos governos (França, Alemanha, Reino Unido e Espanha, por exemplo), à sua maneira, começaram a tomar medidas relativamente à responsabilidade legal para aqueles que fabricam ou propagam notícias falsas. Foram feitas algumas mudanças estruturais nos meios de comunicação sociais, a fim de avaliar as notícias em geral. Muitas equipas foram construídas inteiramente para combater fake news, mais especificamente, os denominados "fact-checkers". Essas equipas têm vindo a adotar diferentes tipos de técnicas para realizar as suas tarefas: desde o uso dos jornalistas para descobrir a verdade por detrás de uma declaração controversa, até aos cientistas de dados, que através de técnicas mais avançadas como as técnicas de Text Minning e métodos de classificação de Machine Learning, apoiam as decisões dos jornalistas. Muitas das entidades que visam manter ou aumentar a sua reputação, começaram a concentrar-se em elevados padrões de qualidade e informação fiável, o que levou à criação de departamentos oficiais e dedicados de verificação de factos. Na primeira parte deste trabalho, contextualizamos o Português Europeu no âmbito da detecção e classificação de notícias falsas, fazendo um levantamento do seu actual estado da arte. De seguida, apresentamos uma solução end-to-end que permite facilmente extrair e armazenar notícias portuguesas europeias previamente classificadas. Utilizando os dados extraídos aplicámos algumas das técnicas de Text Minning e de Machine Learning mais utilizadas, apresentadas na literatura, a fim de compreender e avaliar as possíveis limitações dessas técnicas, neste contexto em específic

    Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

    Get PDF
    Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages
    corecore