Search CORE

25 research outputs found

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

Author: Bakiyev Bakhyt
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/11/2022
Field of study

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.Comment: 2022 International Conference on Smart Information Systems and Technologies (SIST

arXiv.org e-Print Archive

MyBotS Prototype on Social Media Discord with NLP

Author: Al Maksur Imam
Muhajir Muhammad
Publication venue: College of Science for Women - University of Baghdad
Publication date: 30/03/2021
Field of study

أدى النمو المستمر في التكنولوجيا والأجهزة التكنولوجية إلى تطوير الآلات للمساعدة في تسهيل الأنشطة المختلفة المتعلقة بالبشر. على سبيل المثال ، بغض النظر عن أهمية المعلومات على منصة Steam ، لا يزال المشترون أو اللاعبون يحصلون على القليل من المعلومات المتعلقة بالتطبيق. هذا غير مشجع على الرغم من أهمية المعلومات في عصر العولمة الحالي. لذلك ، من الضروري تطوير تطبيق جذاب وتفاعلي يسمح للمستخدمين بطرح الأسئلة والحصول على إجابات ، مثل chatbot ، والذي يمكن تنفيذه على وسائل التواصل الاجتماعي Discord. الذكاء الاصطناعي هو تقنية تسمح للآلات بالتفكير والقدرة على اتخاذ قراراتها الخاصة. أظهر هذا البحث أن نموذج chatbot الخاص بـ discord يوفر خدمات متنوعة بناءً على نتائج اختبار التصنيف باستخدام طريقة SVM بثلاث نوى ، وهي Linear و Polynomial و RBF. تعد بيانات الاختبار وتنبؤ قيم الدقة أكبر Liniear Kernel SVM بدقة وقيم توقع خطأ تبلغ 94٪ و 6٪.The continuous growth in technology and technological devices has led to the development of machines to help ease various human-related activities. For instance, irrespective of the importance of information on the Steam platform, buyers or players still get little information related to the application. This is not encouraging despite the importance of information in this current globalization era. Therefore, it is necessary to develop an attractive and interactive application that allows users to ask questions and get answers, such as a chatbot, which can be implemented on Discord social media. Artificial Intelligence is a technique that allows machines to think and be able to make their own decisions. This research showed that the discord chatbot prototype provides various services based on the results of classification testing using the SVM method with three kernels, namely Linear, Polynomial, and RBF. The test data and accuracy values prediction are the largest Liniear Kernel SVM with accuracy and error prediction values of 94% and 6%

Baghdad Science Journal

Document representations for classification of short web-page descriptions

Author: Ivanović Mirjana
Radovanović Miloš
Publication venue: University of Belgrade
Publication date: 01/01/2008
Field of study

Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.

Directory of Open Access Journals

Enhanced ontology-based text classification algorithm for structurally organized documents

Author: Oleiwi Suha Sahib
Publication venue
Publication date: 01/01/2015
Field of study

Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

Universiti Utara Malaysia: UUM eTheses

Inclusion de sens dans la représentation de documents textuels : état de l'art

Author: Chappelier Jean-Cédric
Eckard Emmanuel
Publication venue
Publication date: 22/01/2008
Field of study

Ce document donne un aperçu de l'état de l'art dans le domaine de la représentation du sens dans les documents textuels

Infoscience - École polytechnique fédérale de Lausanne

Um modelo para a seleção de n-gramas significativos e não redundantes em tarefas de mineração de textos.

Author: CONRADO M. da S.
MOURA M. F.
NOGUEIRA B. M.
REZENDE S. O.
SANTOS F. F. dos
Publication venue: Campinas: Embrapa Informática Agropecuária, 2010.
Publication date: 12/04/2011
Field of study

Uma proposta completa para resolver o problema de selecionar automaticamente atributos não redundantes do tipo n-gramas é apresentada neste trabalho. Geralmente, o uso de n-gramas é um requisito para melhorar a interpretação subjetiva dos resultados em tarefas de mineração de textos, nesses casos, eles são estatisticamente gerados e selecionados. Após a seleção, em geral, há a presença de redundâncias, por exemplo, o termo "informática agropecuária" e seus componentes "informática" e "agropecuária". Assim, propõe-se um modelo que envolve a remoção de stopwords estatisticamente identificadas, uma seleção estatística eficiente para os atributos do tipo n-grama e a remoção das redundâncias apresentadas após a seleção. Observa-se, pelos resultados experimentais apresentados, sobre os atributos originais e os atributos sem as redundâncias, que, como esperado, após a eliminação das redundâncias não há perda de representatividade. Além disso, a redução no número de atributos é expressiva, o que pode significar ganhos em desempenho nas tarefas de extração de padrões, bem como na interpretabilidade subjetiva dos resultados. Deve-se salientar que o método proposto é útil a qualquer algoritmo de aprendizado de máquina aplicado a uma tarefa de mineração de textos, e, parece ser igualmente aplicável a textos em quaisquer línguas.bitstream/item/32458/1/BolPesq23.pd

Infoteca-e