453 research outputs found
Improving NLTK for Processing Portuguese
Python has a growing community of users, especially in the AI and ML fields. Yet, Computational Processing of Portuguese in this programming language is limited, in both available tools and results. This paper describes NLPyPort, a NLP pipeline in Python, primarily based on NLTK, and focused on Portuguese. It is mostly assembled from pre-existent resources or their adaptations, but improves over the performance of existing alternatives in Python, namely in the tasks of tokenization, PoS tagging, lemmatization and NER
Information Extraction for Event Ranking
Search engines are evolving towards richer and stronger semantic approaches, focusing on entity-oriented tasks where knowledge bases have become fundamental. In order to support semantic search, search engines are increasingly reliant on robust information extraction systems. In fact, most modern search engines are already highly dependent on a well-curated knowledge base. Nevertheless, they still lack the ability to effectively and automatically take advantage of multiple heterogeneous data sources. Central tasks include harnessing the information locked within textual content by linking mentioned entities to a knowledge base, or the integration of multiple knowledge bases to answer natural language questions. Combining text and knowledge bases is frequently used to improve search results, but it can also be used for the query-independent ranking of entities like events. In this work, we present a complete information extraction pipeline for the Portuguese language, covering all stages from data acquisition to knowledge base population. We also describe a practical application of the automatically extracted information, to support the ranking of upcoming events displayed in the landing page of an institutional search engine, where space is limited to only three relevant events. We manually annotate a dataset of news, covering event announcements from multiple faculties and organic units of the institution. We then use it to train and evaluate the named entity recognition module of the pipeline. We rank events by taking advantage of identified entities, as well as partOf relations, in order to compute an entity popularity score, as well as an entity click score based on implicit feedback from clicks from the institutional search engine. We then combine these two scores with the number of days to the event, obtaining a final ranking for the three most relevant upcoming events
Discovery of sensitive data with natural language processing
The process of protecting sensitive data is continually growing and becoming increasingly important,
especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but in most cases, the processes behind them are
still manual or semi-automatic. In this work, we have developed a component that can extract
and classify sensitive data, from unstructured text information in European Portuguese. The
objective was to create a system that allows organizations to understand their data and comply
with legal and security purposes. We studied a hybrid approach to the problem of Named
Entities Recognition for the Portuguese language. This approach combines several techniques
such as rule-based/lexical-based models, machine learning algorithms and neural networks. The
rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining
classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models –
Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional-
LSTM approach as experimented. The best results were achieved with the Stanford NER model
(86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional
Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With
the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensÃveis está em constante crescimento e cada vez apresenta
maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia.
O esforço para criar sistemas automáticos é contÃnuo, mas o processo é realizado na maioria dos
casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de
Extração e Classificação de dados sensÃveis, que processa textos não-estruturados em Português
Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender
os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema,
foi estudada uma abordagem hÃbrida de Reconhecimento de Entidades Mencionadas para
a lÃngua Portuguesa. Esta abordagem combina técnicas baseadas em regras e léxicos, algoritmos
de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e
léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes
de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatÃsticos
— Conditional Random Fields e Random Forest – e por fim testada uma abordagem
baseada em redes neuronais – Bidirectional-LSTM. Ao nÃvel das ferramentas utilizadas os melhores
resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos
estatÃsticos percebemos que o Conditional Random Fields é o que consegue obter melhores resultados,
com um f1-score de 65,50%. Com a última abordagem, uma rede neuronal Bi-LSTM,
conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes
abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA
News Corpus e DataSense NER Corpus
MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available
MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
Sentence Boundary Detection (SBD) is one of the foundational building blocks
of Natural Language Processing (NLP), with incorrectly split sentences heavily
influencing the output quality of downstream tasks. It is a challenging task
for algorithms, especially in the legal domain, considering the complex and
different sentence structures used. In this work, we curated a diverse
multilingual legal dataset consisting of over 130'000 annotated sentences in 6
languages. Our experimental results indicate that the performance of existing
SBD models is subpar on multilingual legal data. We trained and tested
monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers,
demonstrating state-of-the-art performance. We also show that our multilingual
models outperform all baselines in the zero-shot setting on a Portuguese test
set. To encourage further research and development by the community, we have
made our dataset, models, and code publicly available.Comment: Accepted at ICAIL 202
Characterizing the personality of twitter users based on their timeline information
Personality is a set of characteristics that differentiate a person from others. It can be identified
by the words that people use in conversations or in publications that they do in social
networks. Most existing work focuses on personality prediction analyzing English texts. In this
study we analyzed publications of the Portuguese users of the social network Twitter. Taking
into account the difficulties in sentiment classification that can be caused by the 140 character
limit imposed on tweets, we decided to use different features and methods such as the quantity
of followers, friends, locations, publication times, etc. to get a more precise picture of a personality.
In this paper, we present methods by which the personality of a user can be predicted
without any effort from the Twitter users. The personality can be accurately predicted through
the publicly available information on Twitter profiles.A personalidade traduz-se num conjunto de caracterÃsticas que diferenciam uma pessoa
de outras. Pode ser identificada pelas palavras que as pessoas usam numa conversa ou em
publicações que fazem nas redes sociais. A maioria dos trabalhos existentes na literatura estão
focados na previsão de personalidade analisando textos em Inglês. Neste estudo, foram analisadas
publicações dos utilizadores Portugueses na rede social Twitter. Tendo em conta que o
limite de 140 caracteres imposto aos tweets pode dificultar a classificação dos sentimentos dos
textos produzidos, foi decidido usar diferentes caracterÃsticas e métodos tais como locais, tempo
de publicação, quantidade de seguidores, quantidade de amigos, etc., para obter uma imagem
mais completa da personalidade. Este documento apresenta um método para fazer a previsão
da personalidade de utilizadores do Twitter, com base na informação existente e sem qualquer
esforço do lado desses utilizadores. A personalidade pode ser calculada através da informação
pública disponÃvel
- …