231 research outputs found

    Dicionário-aberto: a source of resources for the portuguese language processing

    Get PDF
    In this paper we describe how Dicionáio-Aberto, an online dictionary for the Portuguese language, is being used as the base to construct diverse resources that are relevant in the processing of the Portuguese language. We will briefly present its history, explaining how we got here. Then, we will describe the resources already available to download and use, followed by the discussion on the resources that are being currently developed

    Retreading dictionaries for the 21st Century

    Get PDF
    Even in the 21st century, paper dictionaries are still compiled and developed using standard word processors. Many publishing companies are, nowadays, working on converting their dictionaries into computer readable documents, so that they can be used to prepare new features, such as making them available online. Luckily, most of these publishers can pay review teams to fix and even enhance these dictionaries. Unfortunately, research institutions cannot hire that amount of workers. In this article we present the process of retreading a Galician dictionary that was first de- veloped and compiled using Microsoft Word. This dictionary was converted, through automatic rewriting, into a Text Encoding Initiative schema subset. This process will be detailed, and the problems found will be discussed. Given a recent normative that changed the Galician or- thography, the dictionary has undergone a semi-automatic modernization process. Finally, two applications for the obtained dictionaries will be shown.This work was partially supported by Grant TIN2012-38584-C06-04, supported by the Ministry of Economy and Competitiveness of the Spanish Government on “Adquisición de escenarios de conocimiento a través de la lectura de textos: Desarrollo y aplicación de recursos para el procesamiento lingüístico del gallego (SKATeR-UVIGO)”; and by the Xunta de Galicia through the “Rede de Lexicografía (Relex)” (Grant CN 2012/290) and the “Rede de Tecnoloxías e análise dos datos lingüísticos” (Grant CN 2012/179)

    Dictionary alignment by rewrite-based entry translation

    Get PDF
    Series : OASIcs - Open access series in informatics, ISSN 2190-6807, vol. 29In this document we describe the process of aligning two standard monolingual dictionaries: a Portuguese language dictionary and a Galician synonym dictionary. The main goal of the project is to provide an online dictionary that can show, in parallel, definitions and synonyms in Portuguese and Galician for a specific word, written in Portuguese or Galician. These two languages are very close to each other, and that is the main reason we expect this idea to be viable. The main drawback is the lack of a good and free translation dictionary between these two languages, namely, a dictionary that can cover lexicons with more than one hundred thousand different words. To solve this issue we defined a translation function, based on substitutions, that is able to achieve an F1 score of 0.88 on a manually verified dictionary of nine thousand words. Using this same translation function to align a Portuguese–Galician dictionary we obtained almost 50% of the dictionary lexicon (more than eighty thousand words) alignment.This work was partially supported by Grant TIN2012-38584-C06-04, supported by the Ministry of Economy and Competitiveness of the Spanish Government on “Adquisición de escenarios de conocimiento a través de la lectura de textos: Desarrollo y aplicación de recursos para el procesamiento lingüístico del gallego (SKATeR-UVIGO)”; by the Xunta de Galicia through the “Rede de Lexicografía (Relex)” (Grant CN 2012/290) and the “Rede de Tecnoloxías e análise dos datos lingüísticos” (Grant CN 2012/179); and by The Per-Fide project (grant reference no. PTDC/CLEL-LI/108948/2008, from the Portuguese Foundation for Science and Technology, and co-funded by the European Regional Development Fund)

    As Wordnets do Português

    Get PDF
    Series: "Oslo Studies in Language". ISSN 1890-9639. 7(1), 2015.Not many years ago it was usual to comment on the lack of an open lexical- semantic knowledge base, following the lines of Princeton WordNet, but for Portuguese. Today, the landscape has changed significantly, and re- searchers that need access to this specific kind of resource have not one, but several alternatives to choose from. The present article describes the wordnet-like resources currently available for Portuguese. It provides some context on their origin, creation approach, size and license for utilization. Apart from being an obvious starting point for those looking for a computational resource with information on the meaning of Portuguese words, this article describes the resources available, compares them and lists some plans for future work, sketching ideas for potential collaboration between the projects described.CLUPFundação para a Ciência e a Tecnologia (FCT

    LeMe-PT: A Medical Package Leaflet Corpus for Portuguese

    Get PDF
    The current trend on natural language processing is the use of machine learning. This is being done on every field, from summarization to machine translation. For these techniques to be applied, resources are needed, namely quality corpora. While there are large quantities of corpora for the Portuguese language, there is the lack of technical and focused corpora. Therefore, in this article we present a new corpus, built from drug package leaflets. We describe its structure and contents, and discuss possible exploration directions

    Acquiring Domain-Specific Knowledge for WordNet from a Terminological Database

    Get PDF
    In this research we explore a terminological database (Termoteca) in order to expand the Portuguese and Galician wordnets (PULO and Galnet) with the addition of new synset variants (word forms for a concept), usage examples for the variants, and synset glosses or definitions. The methodology applied in this experiment is based on the alignment between concepts of WordNet (synsets) and concepts described in Termoteca (terminological records), taking into account the lexical forms in both resources, their morphological category and their knowledge domains, using the information provided by the WordNet Domains Hierarchy and the Termoteca field domains to reduce the incidence of polysemy and homography in the results of the experiment. The results obtained confirm our hypothesis that the combined use of the semantic domain information included in both resources makes it possible to minimise the problem of lexical ambiguity and to obtain a very acceptable index of precision in terminological information extraction tasks, attaining a precision above 89% when there are two or more different languages sharing at least one lexical form between the synset in Galnet and the Termoteca record

    Assessing Lexical-Semantic Regularities in Portuguese Word Embeddings

    Get PDF
    Models of word embeddings are often assessed when solving syntactic and semantic analogies. Among the latter, we are interested in relations that one would find in lexical-semantic knowledge bases like WordNet, also covered by some analogy test sets for English. Briefly, this paper aims to study how well pretrained Portuguese word embeddings capture such relations. For this purpose, we created a new test, dubbed TALES, with an exclusive focus on Portuguese lexical-semantic relations, acquired from lexical resources. With TALES, we analyse the performance of methods previously used for solving analogies, on different models of Portuguese word embeddings. Accuracies were clearly below the state of the art in analogies of other kinds, which shows that TALES is a challenging test, mainly due to the nature of lexical-semantic relations, i.e., there are many instances sharing the same argument, thus allowing for several correct answers, sometimes too many to be all included in the dataset. We further inspect the results of the best performing combination of method and model to find that some acceptable answers had been considered incorrect. This was mainly due to the lack of coverage by the source lexical resources and suggests that word embeddings may be a useful source of information for enriching those resources, something we also discuss

    Planning non existent dictionaries

    Get PDF
    In 2013, a conference entitled Planning non-existent dictionaries was held at the University of Lisbon. Scholars and lexicographers were invited to present and submit for discussion their research and practices, focusing on aspects that are traditionally perceived as shortcomings by dictionary makers and dictionary users. This book contains a collection of papers divided in three sections. The first section is devoted to heritage dictionaries, referring to lexicographic projects that aim to register all the documented words in a language, particularly those that can be described as early linguistic evidence. The second section is devoted to dictionaries for special purposes and it gathers papers that describe innovative lexicographic projects. The last section in this volume provides an overview of contemporary e- lexicography projects.publishe

    Extração automática de documentos médicos da web para análise textual

    Get PDF
    Dissertação de mestrado integrado em Engenharia Biomédica (especialização em Informática Médica)A literatura científica na biomedicina é um elemento fundamental no processo de obtenção de conhecimento, uma vez que é a maior e mais confiável fonte de informação. Com os avanços tecnológicos e o aumento da competição profissional, o volume e diversidade de documentos médicos científicos tem vindo a aumentar consideravelmente, impedindo que os investigadores acompanhem o crescimento da bibliografia. Para contornar esta situação e reduzir o tempo gasto pelos profissionais na extração dos dados e na revisão da literatura, surgiram os conceitos de Web Crawling, Web Scraping e Processamento de Linguagem Natural, que permitem, respetivamente, a procura, extração e processamento automático de grandes quantidades de texto, abrangendo uma maior gama de documentos científicos do que os normalmente analisados de forma manual. O trabalho desenvolvido para a presente dissertação teve como foco principal o rastreamento e recolha de documentos científicos completos, do campo da biomedicina. Como a maioria dos repositórios da web não disponibiliza, gratuitamente, a totalidade de um documento, mas sim apenas o resumo da publicação, foi importante a seleção de uma base de dados adequada. Por este motivo, as páginas web alvo de rastreamento foram restringidas ao domínio dos repositórios da editora BioMed Central, que disponibilizam por completo, milhares de documentos científicos na área da biomedicina. A arquitetura do sistema desenvolvido divide-se em duas partes principais: fase online e a fase offline. A primeira inclui a procura e extração dos URLs das páginas candidatas a serem extraídas, a recolha dos campos de texto pretendidos e o seu armazenamento numa base de dados. A segunda fase consiste no tratamento e limpeza dos documentos recolhidos, deixando-os num formato estruturado e válido para ser utilizado como entrada de qualquer sistema de análise de texto. Para a concretização da primeira parte, foram utilizadas a framework Scrapy, como base para a construção do scraper, e a base de dados de documentos MongoDB, para o armazenamento das publicações científicas recolhidas. Na segunda etapa do processo, ou seja, na aplicação de técnicas de limpeza e padronização dos dados, foram aproveitadas algumas das inúmeras bibliotecas e funcionalidades que a linguagem Python oferece. Para demonstrar o funcionamento do sistema de extração e tratamento de documentos da área médica, foi estudado o caso prático de recolha de publicações científicas relacionadas com Transtornos Obsessivo Compulsivos. Como resultado de todo o procedimento, foi obtida uma base de dados com quatro coleções de documentos com diferentes níveis de processamento.The scientific literature in biomedicine is a fundamental element in the process of obtaining knowledge, since it is the largest and most reliable source of information. With technological advances and increasing professional competition, the volume and diversity of scientific medical documents increased considerably, preventing researchers from keeping up with the growth of bibliography. To circumvent this situation and reduce the time spent by professionals in data extraction and literature review, the concepts of web crawling, web scraping and natural language processing have emerged, which allow, respectively, the search, extraction and automatic processing of large text, covering a wider range of scientific documents than those normally handled. The work developed for the present dissertation focused on the crawling and collection of complete scientific documents from the field of biomedicine. As most web repositories do not provide the entire document for free, but only the abstract of the publication, it was important to select an appropriate database. For this reason, the crawled web pages have been restricted to the domain of BioMed Central repositories, which provide thousands of scientific papers in the field of biomedicine. The system architecture in question is divided into two main parts: the online phase and the offline phase. The first one includes searching and extracting the URLs of the candidate pages to be extracted, collecting the desired text fields and storing them in a database. The second phase is the handling and cleaning of the collected documents, leaving them in a structured and valid format to be used as input to any text analysis system. For the realization of the first part, it was used the Scrapy framework as the basis for the construction of the scraper and the MongoDB document database for storing the collected scientific publications. In the second step of the process, that is, for the application of data cleaning and standardization techniques, some of the numerous libraries and functionalities that the Python language offers are taken advantage of. In order to demonstrate the operation of the document extraction system, the practical case of collecting scientific publications related to Obsessive Compulsive Disorders was studied. As a result of the entire procedure, a database with four document collections with different processing levels was obtained
    corecore