27 research outputs found
Sistem Rekomendasi Produk Pena Eksklusif Menggunakan Metode Content-Based Filtering dan TF-IDF
Sistem rekomendasi saat ini sedang menjadi tren. Kebiasaan masyarakat yang saat ini lebih mengandalkan transaksi secara online dengan berbagai alasan pribadi. Sistem rekomendasi menawarkan cara yang lebih mudah dan cepat sehingga pengguna tidak perlu meluangkan waktu terlalu banyak untuk menemukan barang yang diinginkan. Persaingan antar pelaku bisnis pun berubah sehingga harus mengubah pendekatan agar bisa menjangkau calon pelanggan. Oleh karena itu dibutuhkan sebuah sistem yang dapat menunjang hal tersebut. Maka dalam penelitian ini, penulis membangun sistem rekomendasi produk menggunakan metode Content-Based Filtering dan Term Frequency Inverse Document Frequency (TF-IDF) dari model Information Retrieval (IR). Untuk memperoleh hasil yang efisien dan sesuai dengan kebutuhan solusi dalam meningkatkan Customer Relationship Management (CRM). Sistem rekomendasi dibangun dan diterapkan sebagai solusi agar dapat meningkatkan brand awareness pelanggan dan meminimalisir terjadinya gagal transaksi di karenakan kurang nya informasi yang dapat disampaikan secara langsung atau offline. Data yang digunakan terdiri dari 258 kode produk produk yang yang masing-masing memiliki delapan kategori dan 33 kata kunci pembentuk sesuai dengan product knowledge perusahaan. Hasil perhitungan TF-IDF menunjukkan nilai bobot 13,854 saat menampilkan rekomendasi produk terbaik pertama, dan memiliki keakuratan sebesar 96,5% dalam memberikan rekomendasi pena
Product Codefication Accuracy With Cosine Similarity And Weighted Term Frequency And Inverse Document Frequency (TF-IDF)
In the SiPaGa application, the codefication search process is still inaccurate, so OPD often make mistakes in choosing goods codes. So we need Cosine Similarity and TF-IDF methods that can improve the accuracy of the search. Cosine Similarity is a method for calculating similarity by using keywords from the code of goods. Term Frequency and Inverse Document (TFIDF) is a way to give weight to a one-word relationship (term). The purpose of this research is to improve the accuracy of the search for goods codification. Codification of goods processed in this study were 14,417 data sourced from the Goods and Price Planning Information System (SiPaGa) application database. The search keywords were processed using the Cosine Similarity method to see the similarities and using TF-IDF to calculate the weighting. This research produces the calculation of cosine similarity and TF-IDF weighting and is expected to be applied to the SiPaGa application so that the search process on the SiPaGa application is more accurate than before. By using the cosine sismilarity algorithm and TF-IDF, it is hoped that it can improve the accuracy of the search for product codification. So that OPD can choose the product code as desire
Text documents clustering using data mining techniques
Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights
Web document classification using topic modeling based document ranking
In this paper, we propose a web document ranking method using topic modeling for effective information collection and classification. The proposed method is applied to the document ranking technique to avoid duplicated crawling when crawling at high speed. Through the proposed document ranking technique, it is feasible to remove redundant documents, classify the documents efficiently, and confirm that the crawler service is running. The proposed method enables rapid collection of many web documents; the user can search the web pages with constant data update efficiently. In addition, the efficiency of data retrieval can be improved because new information can be automatically classified and transmitted. By expanding the scope of the method to big data based web pages and improving it for application to various websites, it is expected that more effective information retrieval will be possible
Analisis Sentimen terhadap ChatGPT Menggunakan Metode Random Forest Classifier
ChatGPT (Generative Pre-trained Transformer) adalah sebuah chatbot yang dikembangkan oleh OpenAI dan dirilis pada 30 November 2022. ChatGPT menarik banyak perhatian karena memberikan jawaban yang detail dan mengartikulasikan jawaban di berbagai bidang pengetahuan. Terdapat berbagai macam respon terhadap ChatGPT, baik respon positif maupun negatif. Analisis sentimen adalah proses mengumpulkan dan menganalisis pendapat seseorang tentang suatu topik tertentu. Data yang digunakan dalam penelitian ini diambil dari media sosial Twitter. Pada penelitian ini, analisis sentimen akan dilakukan dengan menggunakan Random Forest Classifier. Penelitian ini dimulai dengan melakukan studi literatur terlebih dahulu. Setelah melakukan studi literatur, tahap selanjutnya adalah mengumpulkan data dari Twitter, lalu melakukan pembuatan model. Setelah itu, akan dilakukan pengujian dan evaluasi. Pada penelitian ini, dilakukan beberapa skenario untuk uji coba. Berdasarkan hasil uji coba yang telah dilakukan, didapatkan nilai akurasi tertinggi sebesar 74,35%. Nilai tertinggi dari presisi, recall, dan f1-score secara berturut-turut adalah 73,27%, 73,87%, dan 72,87%. Hasil performance terbaik ini didapatkan dari skenario pembobotan kata menggunakan CountVectorizer pada data yang tidak seimbang
Machine learning and natural language processing in domain classification of scientific knowledge objects: a review
The domain classification of scientific knowledge objects has been continuously improved over the years. Systems that can automatically classify a scientific knowledge object, through the use of artificial intelligence, machine learning algorithms, natural language processing, and others, have been adopted in most scientific knowledge databases to maintain internal classification consistency as well as to simplify the information arrangement. However, the amount of available data has grown exponentially in the last few years and now it can be found in multiple platforms under different classifications due to the implementation of different classification systems. Thus, the process of searching and selecting relevant data in research studies and projects has become more complex and the time needed to find the right information has continuously grown as well. Therefore, machine learning and natural language processing play an important role in the development and achievement of automatic and standardized classification systems that will aid researchers in their research work.This work has been supported by IViSSEM: POCI-01-0145-FEDER-28284
Combination of Term Weighting with Class Distribution and Centroid-based Approach for Document Classification
A text retrieval system requires a method that is able to return a number of documents with high relevance upon user requests. One of the important stages in the text representation process is the weighting process. The use of Term Frequency (TF) considers the number of word occurrences in each document, while Inverse Document Frequency (IDF) considers the wide distribution of words throughout the document collection. However, the TF-IDF weighting cannot represent the distribution of words to documents with many classes or categories. The more unequal the distribution of words in each category, the more important the word features should be. This study developed a new term weighting method where weighting is carried out based on the frequency of occurrence of terms in each class which is integrated with the distribution of centroid-based terms which can minimize intra-cluster similarity and maximize inter-cluster variance. The ICF.TDCB term weighting method has been able to provide the best results in its application to SVM modeling with a dataset of 931 online news documents. The results show that SVM modeling had accuracy of 0.723, outperforming the use of other term weightings such as TF.IDF, ICF & TDCB
Recommended from our members
Enhancing Scholarly Understanding: A Comparison of Knowledge Injection Strategies in Large Language Models
The use of transformer-based models like BERT for natural language processing has achieved remarkable performance across multiple domains. However, these models face challenges when dealing with very specialized domains, such as scientific literature. In this paper, we conduct a comprehensive analysis of knowledge injection strategies for transformers in the scientific domain, evaluating four distinct methods for injecting external knowledge into transformers. We assess these strategies in a single-label multi-class classification task involving scientific papers. For this, we develop a public benchmark based on 12k scientific papers from the AIDA knowledge graph, categorized into three fields. We utilize the Computer Science Ontology as our external knowledge source. Our findings indicate that most proposed knowledge injection techniques outperform the BERT baseline
Comparative Topic Modeling for Determinants of Divergent Report Results Applied to Macular Degeneration Studies
Topic modeling and text mining are subsets of Natural Language Processing
with relevance for conducting meta-analysis (MA) and systematic review (SR).
For evidence synthesis, the above NLP methods are conventionally used for
topic-specific literature searches or extracting values from reports to
automate essential phases of SR and MA. Instead, this work proposes a
comparative topic modeling approach to analyze reports of contradictory results
on the same general research question. Specifically, the objective is to find
topics exhibiting distinct associations with significant results for an outcome
of interest by ranking them according to their proportional occurrence and
consistency of distribution across reports of significant results. The proposed
method was tested on broad-scope studies addressing whether supplemental
nutritional compounds significantly benefit macular degeneration (MD). Eight
compounds were identified as having a particular association with reports of
significant results for benefitting MD. Six of these were further supported in
terms of effectiveness upon conducting a follow-up literature search for
validation (omega-3 fatty acids, copper, zeaxanthin, lutein, zinc, and
nitrates). The two not supported by the follow-up literature search (niacin and
molybdenum) also had the lowest scores under the proposed methods ranking
system, suggesting that the proposed method's score for a given topic is a
viable proxy for its degree of association with the outcome of interest. These
results underpin the proposed methods potential to add specificity in
understanding effects from broad-scope reports, elucidate topics of interest
for future research, and guide evidence synthesis in a systematic and scalable
way
Topic Detection on Twitter Using Deep Learning Method with Feature Expansion GloVe
Twitter is a medium of communication, transmission of information, and exchange of opinions on a topic with an extensive reach. Twitter has a tweet with a text message of 280 characters. Because text messages can only be written briefly, tweets often use slang and may not follow structured grammar. The diverse vocabulary in tweets leads to word discrepancies, so tweets are difficult to understand. The problem often found in classifying topics in tweets is that they need higher accuracy due to these factors. Therefore, the authors used the GloVe feature expansion to reduce vocabulary discrepancies by building a corpus from Twitter and IndoNews. Research on the classification of topics in previous tweets has been done extensively with various Machine Learning or Deep Learning methods using feature expansion. However, To the best of our knowledge, Hybrid Deep Learning has not been previously used for topic classification on Twitter. Therefore, the study conducted experiments to analyze the impact of Hybrid Deep Learning and the expansion of GloVe features on classification topics. The total data used in this study was 55,411 datasets in Indonesian-language text. The methods used in this study are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Hybrid CNN-RNN. The results show that the topic classification system with GloVe feature expansion using the CNN method achieved the highest accuracy of 92.80%, with an increase of 0.40% compared to the baseline. The RNN followed it with an accuracy of 93.72% and a 0.23% improvement. The CNN-RN Hybrid Deep Learning model achieved the highest accuracy of 94.56%, with a significant increase of 2.30%. The RNN-CNN model also achieved high accuracy, reaching 94.39% with a 0.95% increase. Based on the accuracy results, the Hybrid Deep Learning model, with the addition of feature expansion, significantly improved the system's performance, resulting in higher accuracy