25 research outputs found

    Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

    Full text link
    The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.Comment: 2022 International Conference on Smart Information Systems and Technologies (SIST

    MyBotS Prototype on Social Media Discord with NLP

    Get PDF
    أدى النمو المستمر في التكنولوجيا والأجهزة التكنولوجية إلى تطوير الآلات للمساعدة في تسهيل الأنشطة المختلفة المتعلقة بالبشر. على سبيل المثال ، بغض النظر عن أهمية المعلومات على منصة Steam ، لا يزال المشترون أو اللاعبون يحصلون على القليل من المعلومات المتعلقة بالتطبيق. هذا غير مشجع على الرغم من أهمية المعلومات في عصر العولمة الحالي. لذلك ، من الضروري تطوير تطبيق جذاب وتفاعلي يسمح للمستخدمين بطرح الأسئلة والحصول على إجابات ، مثل chatbot ، والذي يمكن تنفيذه على وسائل التواصل الاجتماعي Discord. الذكاء الاصطناعي هو تقنية تسمح للآلات بالتفكير والقدرة على اتخاذ قراراتها الخاصة. أظهر هذا البحث أن نموذج chatbot الخاص بـ discord يوفر خدمات متنوعة بناءً على نتائج اختبار التصنيف باستخدام طريقة SVM بثلاث نوى ، وهي Linear و Polynomial و RBF. تعد بيانات الاختبار وتنبؤ قيم الدقة أكبر Liniear Kernel SVM بدقة وقيم توقع خطأ تبلغ 94٪ و 6٪.The continuous growth in technology and technological devices has led to the development of machines to help ease various human-related activities. For instance, irrespective of the importance of information on the Steam platform, buyers or players still get little information related to the application. This is not encouraging despite the importance of information in this current globalization era. Therefore, it is necessary to develop an attractive and interactive application that allows users to ask questions and get answers, such as a chatbot, which can be implemented on Discord social media. Artificial Intelligence is a technique that allows machines to think and be able to make their own decisions. This research showed that the discord chatbot prototype provides various services based on the results of classification testing using the SVM method with three kernels, namely Linear, Polynomial, and RBF. The test data and accuracy values prediction are the largest Liniear Kernel SVM with accuracy and error prediction values of 94% and 6%

    Document representations for classification of short web-page descriptions

    Get PDF
    Motivated by applying Text Categorization to classification of Web search results, this paper describes an extensive experimental study of the impact of bag-of- words document representations on the performance of five major classifiers - Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts, representing short Web-page descriptions sorted into a large hierarchy of topics, are taken from the dmoz Open Directory Web-page ontology, and classifiers are trained to automatically determine the topics which may be relevant to a previously unseen Web-page. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics - accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.

    Enhanced ontology-based text classification algorithm for structurally organized documents

    Get PDF
    Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

    Inclusion de sens dans la représentation de documents textuels : état de l'art

    Get PDF
    Ce document donne un aperçu de l'état de l'art dans le domaine de la représentation du sens dans les documents textuels

    Um modelo para a seleção de n-gramas significativos e não redundantes em tarefas de mineração de textos.

    Get PDF
    Uma proposta completa para resolver o problema de selecionar automaticamente atributos não redundantes do tipo n-gramas é apresentada neste trabalho. Geralmente, o uso de n-gramas é um requisito para melhorar a interpretação subjetiva dos resultados em tarefas de mineração de textos, nesses casos, eles são estatisticamente gerados e selecionados. Após a seleção, em geral, há a presença de redundâncias, por exemplo, o termo "informática agropecuária" e seus componentes "informática" e "agropecuária". Assim, propõe-se um modelo que envolve a remoção de stopwords estatisticamente identificadas, uma seleção estatística eficiente para os atributos do tipo n-grama e a remoção das redundâncias apresentadas após a seleção. Observa-se, pelos resultados experimentais apresentados, sobre os atributos originais e os atributos sem as redundâncias, que, como esperado, após a eliminação das redundâncias não há perda de representatividade. Além disso, a redução no número de atributos é expressiva, o que pode significar ganhos em desempenho nas tarefas de extração de padrões, bem como na interpretabilidade subjetiva dos resultados. Deve-se salientar que o método proposto é útil a qualquer algoritmo de aprendizado de máquina aplicado a uma tarefa de mineração de textos, e, parece ser igualmente aplicável a textos em quaisquer línguas.bitstream/item/32458/1/BolPesq23.pd
    corecore