3 research outputs found
Comparação de arquiteturas de Word2Vec na análise de textos curtos
Em função do avanço na produção e armazenamento de dados de texto, houve uma grande procura pela área de Processamento de Linguagem Natural (NLP), o que acarretou o desenvolvimento de mĂ©todos cada vez mais complexos para lidar com tarefas relativas a diversas finalidades. Entre esses mĂ©todos encontra-se o Word2Vec, um algoritmo que utiliza redes neurais para aprender representações de palavras. Ele possui duas arquiteturas de rede: o CBoW, que tem como objetivo prever a palavra central de uma sentença atravĂ©s das palavras ao redor, o chamado contexto, e o Skip-gram, que faz o contrário, busca prever o contexto com base na palavra central. O presente trabalho visa aplicar as duas arquiteturas associadas ao Word2Vec a fim de obter representações word embeddings de palavras contidas em descrições de produtos de notas fiscais eletrĂ´nicas. Este dado Ă© nĂŁo estruturado, com tamanho máximo de 120 caracteres, possuindo vários desafios associados Ă análise de textos curtos alĂ©m do vocabulário bastante especĂfico das descrições. Foram ajustados alguns modelos para bancos de dados vinculados a dois produtos: leite e carne. Foram comparados ajustes considerando a repetição ou nĂŁo dos documentos, o mĂnimo de vezes que as palavras aparecem no corpus e diferentes tamanhos de janela de contexto.Due to the advances in the production and storage of text data, there was a great demand for the area of Natural Language Processing (NLP), which led to the development of increasingly complex methods to deal with tasks related to different purposes. Among these methods is Word2Vec, an algorithm that uses neural networks to learn word representations. It has two network architectures: CBoW, which aims to predict the central word of a sentence through the surrounding words, the socalled context, and Skip-gram, which does the opposite, and seeks to predict the context based on the central word. The present work aims to apply the two architectures associated withWord2Vec to obtain word embeddings representations of words contained in product descriptions of electronic invoices. This data is unstructured, with a maximum size of 120 characters, with several challenges associated with the analysis of short texts in addition to the very specific vocabulary of the descriptions. Some models were adjusted for databases linked to two products: milk and meat. Adjustments were compared considering the repetition or not of the documents, the minimum number of times the words appear in the corpus, and different sizes of the context window
Zero-inflated-censored Weibull and gamma regression models to estimate wild boar population dispersal distance
The dynamics of the wild boar population has become a pressing issue not only for ecological purposes, but also for agricultural and livestock production. The data related to the wild boar dispersal distance can have a complex structure, including excess of zeros and right-censored observations, thus being challenging for modeling. In this sense, we propose two different zero-inflated-right-censored regression models, assuming Weibull and gamma distributions. First, we present the construction of the likelihood function, and then, we apply both models to simulated datasets, demonstrating that both regression models behave well. The simulation results point to the consistency and asymptotic unbiasedness of the developed methods. Afterwards, we adjusted both models to a simulated dataset of wild boar dispersal, including excess of zeros, right-censored observations, and two covariates: age and sex. We showed that the models were useful to extract inferences about the wild boar dispersal, correctly describing the data mimicking a situation where males disperse more than females, and age has a positive effect on the dispersal of the wild boars. These results are useful to overcome some limitations regarding inferences in zero-inflated-right-censored datasets, especially concerning the wild boar’s population. Users will be provided with an R function to run the proposed models