19 research outputs found
Text documents clustering using modified multi-verse optimizer
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems.
For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods
Hybrid Multi Attribute Relation Method for Document Clustering for Information Mining
Text clustering has been widely utilized with the aim of partitioning speci?c documents’ collection into different subsets using homogeneity/heterogeneity criteria. It has also become a very complicated area of research, including pattern recognition, information retrieval, and text mining. In the applications of enterprises, information mining faces challenges due to the complex distribution of data by an enormous number of different sources. Most of these information sources are from different domains which create difficulties in identifying the relationships among the information. In this case, a single method for clustering limits related information, while enhancing computational overheadsand processing times. Hence, identifying suitable clustering models for unsupervised learning is a challenge, specifically in the case of MultipleAttributesin data distributions. In recent works attribute relation based solutions are given significant importance to suggest the document clustering. To enhance further, in this paper, Hybrid Multi Attribute Relation Methods (HMARs) are presented for attribute selections and relation analyses of co-clustering of datasets. The proposed HMARs allowanalysis of distributed attributes in documents in the form of probabilistic attribute relations using modified Bayesian mechanisms. It also provides solutionsfor identifying most related attribute model for the multiple attribute documents clustering accurately. An experimental evaluation is performed to evaluate the clustering purity and normalization of the information utilizing UCI Data repository which shows 25% better when compared with the previous techniques
Research on Medical Question Answering System Based on Knowledge Graph
To meet the high-efficiency question answering needs of existing patients and doctors, this system integrates medical professional knowledge, knowledge graphs, and question answering systems that conduct man-machine dialogue through natural language. This system locates the medical field, uses crawler technology to use vertical medical websites as data sources, and uses diseases as the core entity to construct a knowledge graph containing 44,000 knowledge entities of 7 types and 300,000 entities of 11 kinds. It is stored in the Neo4j graph database, using rule-based matching methods and string-matching algorithms to construct a domain lexicon to classify and query questions. This system has specific practical value in the medical field knowledge graph and question answering system
Boolean logic algebra driven similarity measure for text based applications
In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks
Comparing sentiment analysis tools on gitHub project discussions
Mestrado de dupla diplomação com a UTFPR - Universidade Tecnológica Federal do ParanáThe context of this work is situated in the rapidly evolving sphere of Natural Language Processing (NLP) within the scope of software engineering, focusing on sentiment analysis in software repositories. Sentiment analysis, a subfield of NLP, provides a potent method to parse, understand, and categorize these sentiments expressed in text. By applying sentiment analysis to software repositories, we can decode developers’ opinions and sentiments, providing key insights into team dynamics, project health, and potential areas of conflict or collaboration. However, the application of sentiment analysis in software engineering comes with its unique set of challenges. Technical jargon, code-specific ambiguities, and the brevity of software-related communications demand tailored NLP tools for effective analysis.
The study unfolds in two primary phases. In the initial phase, we embarked on a meticulous investigation into the impacts of expanding the training sets of two prominent sentiment analysis tools, namely, SentiCR and SentiSW. The objective was to delineate the correlation between the size of the training set and the resulting tool performance, thereby revealing any potential enhancements in performance. The subsequent phase of the research encapsulates a practical application of the enhanced tools. We employed these tools to categorize discussions drawn from issue tickets within a varied array of Open-Source projects. These projects span an extensive range, from relatively small repositories to large, well-established repositories, thus providing a rich and diverse sampling ground.O contexto deste trabalho situa-se na esfera em rápida evolução do Processamento de Linguagem Natural (PLN) no âmbito da engenharia de software, com foco na análise de sentimentos em repositórios de software. A análise de sentimentos, um subcampo do PLN, fornece um método poderoso para analisar, compreender e categorizar os sentimentos expressos em texto. Ao aplicar a análise de sentimentos aos repositórios de software, podemos decifrar as opiniões e sentimentos dos desenvolvedores, fornecendo informações importantes sobre a dinâmica da equipe, a saúde do projeto e áreas potenciais de conflito
ou colaboração. No entanto, a aplicação da análise de sentimentos na engenharia de software apresenta desafios únicos. Jargão técnico, ambiguidades específicas do código e a breviedade das comunicações relacionadas ao software exigem ferramentas de PLN personalizadas para uma análise eficaz. O estudo se desenvolve em duas fases principais. Na fase inicial, embarcamos em uma investigação meticulosa sobre os impactos da expansão dos conjuntos de treinamento de duas ferramentas proeminentes de análise de sentimentos, nomeadamente, SentiCR e SentiSW. O objetivo foi delinear a correlação entre o tamanho do conjunto de treinamento e o desempenho da ferramenta resultante, revelando assim possíveis aprimoramentos no desempenho.
A fase subsequente da pesquisa engloba uma aplicação prática das ferramentas aprimoradas. Utilizamos essas ferramentas para categorizar discussões retiradas de bilhetes de problemas em uma variedade diversificada de projetos de código aberto. Esses projetos abrangem uma ampla gama, desde repositórios relativamente pequenos até repositórios grandes e bem estabelecidos, fornecendo assim um campo de amostragem rico e diversificado
PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
The Poisson document length distribution has been used extensively in the past for
modeling topics with the expectation that its effect will disintegrate at the end of the
model definition. This procedure often leads to down Playing word correlation with
topics and reducing retrieved documents' precision or accuracy. The existing
document model, such as the Latent Dirichlet Allocation (LDA) model, does not
accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma
Latent Dirichlet Allocation (PGLDA) model for modeling word
dependencies in topic modeling is introduced. The PGLDA model relaxes the words
independence assumption in the existing Latent Dirichlet Allocation (LDA) model
by introducing the Gamma distribution that captures the correlation between adjacent
words in documents. The PGLDA is hybridized with the distributed representation of
documents (Doc2Vec) and topics (Topic2Vec) to form a new model named
PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec
and Topic2Vec vectors to form new word representation vectors, combined with
topics with the largest estimated probability using PGLDA. Model estimations for
PGLDA and PGLDA2Vec models were achieved by combining the Laplacian
approximation of log-likelihood for PGLDA and Feed-Forward Neural Network
(FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the
hybrid PGLDA2Vec models were assessed using precision, micro F1 scores,
perplexity, and coherence score. The empirical analysis results using three real-world
datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid
PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of
96.3%, across the three datasets is better than other competing models reviewed
Detection of Hate Tweets using Machine Learning and Deep Learning
Cyberbullying has become a highly problematic occurrence due to its potential of anonymity and its ease for others to join in the harassment of victims. The distancing effect that technological devices have, has led to cyberbullies say and do harsher things compared to what is typical in a traditional face-to-face bullying situation. Given the great importance of the problem, detection is becoming a key area of cyberbullying research. Therefore, it is highly necessary for a framework to accurately detect new cyberbullying instances automatically. To review the machine learning and deep learning approaches, two datasets were used. The first dataset was provided by the University of Maryland consisting of over 30,000 tweets, whereas the second dataset was based on the article `Automated Hate Speech Detection and the Problem of Offensive Language' by Davidson et al., containing roughly 25,000 tweets. The paper explores machine learning approaches using word embeddings such as DBOW (Distributed Bag of Words) and DMM (Distributed Memory Mean) and the performance of Word2vec Convolutional Neural Networks (CNNs) to classify online hate
Extracción de patrones en las reseñas sobre celulares mediante el modelado de temas y el análisis de sentimientos
En la era digital, las redes sociales han cambiado la forma de comunicarnos: las mismas se
convirtieron en una fuente de información e intercambio fundamental. El contenido que se genera
en ellas requiere ser analizado mediante la aplicación de diversas técnicas de procesamiento del
lenguaje natural, con el propósito de encontrar tendencias o patrones en las opiniones y
comportamientos de las personas. Dicho análisis, le permite a las distintas áreas de las
organizaciones enfocar sus esfuerzos en desarrollar estrategias que busquen la satisfacción de los
consumidores, así como también que les permita posicionar sus propuestas y productos.
Este estudio se centra en la identificación de las dimensiones claves relacionadas con la compra de
teléfonos móviles a través de internet. Específicamente nos basamos en información recolectada de
Mercado Libre, ya que es un comercio electrónico que contiene un gran volumen de datos. En primer
lugar, extrajimos los datos de las reseñas de la categoría "Celulares y Teléfonos" y realizamos un
preprocesamiento de los mismos, que incluyó la eliminación de palabras vacías, la normalización y
tokenización de los datos. Luego, para comenzar a comprender las razones en las cuáles los
consumidores se basan para realizar sus elecciones, aplicamos métodos de aprendizaje no
supervisado, que incluyeron la extracción de los cinco tópicos principales, utilizando la
transformación del texto a una bolsa de palabras (en inglés, bag of words) y el método de Asignación
latente de Dirichlet (LDA). También lo complementamos con técnicas de análisis de sentimiento, que
están enfocadas en comprender las diversas palabras y expresiones que los seres humanos
utilizamos para expresar nuestro grado de aceptación hacia un tema o producto, de manera de
poder convertir las emociones en información objetiva.
Adicionamos a lo mencionado anteriormente, métodos de aprendizaje supervisado para aprovechar
la información contenida en las etiquetas, es decir, en los puntajes de las reseñas. Para ello utilizamos
una combinación de dos tipos de enfoques para extraer características: el enfoque de la bolsa de
palabras previamente mencionado y TF-IDF (del inglés Term frequency – Inverse document frequency,
frecuencia de término – frecuencia inversa de documento). Luego, entrenamos y evaluamos
algoritmos de clasificación capaces de predecir los puntajes, de manera tal que puedan darnos una
valoración social lo más acertada posible. Nos enfocamos en cuatro modelos de clasificación:
Random Forest (en español, Bosque Aleatorio), Support Vector Machine (en español, Máquinas de
Vector Soporte), Naive Bayes (en español, Bayes Ingenuo) y Logistic Regression (en español,
Regresión Logística). Los resultados del estudio encuentran implicaciones prácticas para el desarrollo
de los celulares, ya que permiten hacer foco en los tópicos y aspectos clave en los que los
consumidores se basan para hacer sus elecciones