Search CORE

23 research outputs found

REPENTINO - A Wide-Scope Gazetteer for Entity Recognition in Portuguese

Author: Cabral Luís
Pinto Ana Sofia
Sarmento Luís
Publication venue: Springer Verlag
Publication date: 13/01/2009
Field of study

Repositório Comum

BACO - A large database of text and co-occurrences

Author: Sarmento Luís
Publication venue
Publication date: 06/11/2008
Field of study

Repositório Comum

Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon

Author: Ferrández Sergio
Monachini Monica
Muñoz Rafael
Toral Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/06/2011
Field of study

This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for diﬀerent languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which aﬀects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The diﬀerent steps of the procedure (mapping, disambiguation, extraction, NE identiﬁcation and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented

DCU Online Research Access Service

Portuguese corpus-based learning using ETL

Author: Cícero Nogueira dos Santos
Julio Cesar Duarte
Ruy Luiz Milidiú
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

Esfinge - Resposta a perguntas usando a Rede

Author: Costa Luís
Publication venue: IADIS Press
Publication date: 28/11/2008
Field of study

Repositório Comum

Resumo da actividade da Linguateca de 15 de Maio de 2003 a 15 de Dezembro de 2006

Author: Santos Diana
Publication venue
Publication date: 15/10/2009
Field of study

Repositório Comum

Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Author: A. Lenci
Antonio Toral
Aristotle
D. Lenat
G. A. Miller
H. Alshawi
I. H. Witten
J. Giles
J. M. Wiebe
J. Pustejovsky
M. A. Hearst
Monica Monachini
O. Etzioni
P. Vossen
Rafael Muñoz
S. P. Ponzetto
Sergio Ferrández
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Uma revisão para o Reconhecimento de Entidades Nomeadas aplicado à língua portuguesa

Author: Andressa Vieira e Silva
Publication venue: Universidade do Minho & Universidade de Vigo
Publication date: 01/12/2023
Field of study

O Reconhecimento de Entidades Nomeadas (REN) é a tarefa de identificação e classificação automática de entidades em um texto, tais como nomes de pessoas, lugares e organizações. Essa é uma tarefa importante em Processamento de Língua Natural, servindo como base de diversas aplicações, como tradução automática e sistemas de pergunta-e-resposta. Desde seu surgimento na década de 90, a tarefa passou por diversos fases com relação à abordagem computacional, indo dos sistemas baseados em regras manuais aos modelos de redes neurais. Este artigo traz uma revisão da tarefa de REN considerando aplicações em textos de língua portuguesa. Apresenta-se um panorama geral da tarefa, traçando um histórico das principais iniciativas para promovê-la, dos recursos linguísticos e computacionais disponíveis e das abordagens já avaliadas para REN para o português. Por fim, apresenta-se uma discussão do cenário geral em que a tarefa se encontra e as considerações finais de análise

Directory of Open Access Journals

Estudando os autores: Trabalho referente à colaboração com o RCAAP

Author: Ribeiro Fernando
Santos Diana
Publication venue: Linguateca,
Publication date: 01/01/2010
Field of study

Repositório Comum

Discovery of sensitive data with natural language processing

Author: Dias Mariana Rebelo
Publication venue
Publication date: 18/12/2019
Field of study

The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entities Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models – Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional- LSTM approach as experimented. The best results were achieved with the Stanford NER model (86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensíveis está em constante crescimento e cada vez apresenta maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia. O esforço para criar sistemas automáticos é contínuo, mas o processo é realizado na maioria dos casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de Extração e Classificação de dados sensíveis, que processa textos não-estruturados em Português Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema, foi estudada uma abordagem híbrida de Reconhecimento de Entidades Mencionadas para a língua Portuguesa. Esta abordagem combina técnicas baseadas em regras e léxicos, algoritmos de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatísticos — Conditional Random Fields e Random Forest – e por fim testada uma abordagem baseada em redes neuronais – Bidirectional-LSTM. Ao nível das ferramentas utilizadas os melhores resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos estatísticos percebemos que o Conditional Random Fields é o que consegue obter melhores resultados, com um f1-score de 65,50%. Com a última abordagem, uma rede neuronal Bi-LSTM, conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA News Corpus e DataSense NER Corpus

Repositório Institucional do ISCTE-IUL