Search CORE

294 research outputs found

Kombinasi Ekstraksi Kata Kunci dan Ekspansi Kueri Untuk Deteksi Isu Etik pada Ringkasan Penelitian Kesehatan

Author: Anggraini Ratih Nur Esti
Hamidi Mohammad Zaenuddin
Purwitasari Diana
Publication venue: 'Universitas Dian Nuswantoro'
Publication date: 23/02/2023
Field of study

Penelitian kesehatan harus melalui proses telaah etik yang bertujuan untuk mengantisipasi dugaan atas risiko fisik, sosial, ekonomi dan psikologis. Secara etik penelitian kesehatan dapat diterima apabila pada penelitian tersebut mampu dibuktikan dengan metode ilmiah yang valid serta lulus uji etik sebelum penelitian dilakukan. Untuk memastikan pada ringkasan penelitian kesehatan terdapat aspek etik, dibutuhkan kata kunci yang dapat dijadikan representasi dari isi ringkasan tersebut. Salah satu pendekatan yang sering dilakukan adalah dengan menghitung frekuensi kemunculan kata dalam dokumen. Pendekatan lain yaitu pendekatan YAKE dan keyBERT yang tidak hanya menghitung frekuensi kata namun juga menghitung konteks kata. Selain melakukan ekstraksi dilakukan juga proses ekspansi kueri sebagai upaya memperluas istilah yang dapat mewakili masing-masing aspek etik. Salah satu pendekatan yang digunakan untuk ekspansi kueri adalah model word2Vec. Penelitian ini mengusulkan metode pengembangan ekspansi kueri dengan dan metode ekstraksi kata kunci seperti TFIDF,YAKE dan keyBERT dan mengombinasikannya dengan fuzzy. Hasil eksperimen menunjukkan bahwa metode paling unggul secara presisi yaitu YAKE dan gabungan antara TFIDF + YAKE + keyBERT dengan nilai tertinggi 46% kemudian dari untuk recall model YAKE mendapat nilai tertinggi dengan angka 72% dan untuk nilai F1-Score yang paling unggul adalah metode YAKE dengan nilai tertinggi 54%

Techno.Com

Academic assistance chatbot-a comprehensive NLP and deep learning-based approaches

Author: A. Ahmed Ahmed
H. Anwar Sara
K. Farouq Amr
K. Negied Nermin
M. Abouaish Karim
M. Matta Emil
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/02/2024
Field of study

The rapid growth of digital technologies and natural language processing (NLP) have revolutionized the field of education, creating new demand for automated academic assistance systems. In this paper, we present an NLP-based academic assistance chatbot designed to provide comprehensive support to students and researchers using deep learning techniques. The chatbot incorporates a range of intelligent features to assist with university recommendations, article writing, automatic question answering (QA), and job search. By leveraging sentiment analysis and sarcasm detection models. The proposed chatbot could offer accurate and insightful university recommendations. Additionally, the chatbot incorporates spell and grammar checking, summarization, paraphrasing, and topic modeling capabilities to aid users in enhancing their writing skills. The QA module enables users to obtain quick and precise answers to factoid-based questions. Moreover, the chatbot helps with internships and job search. According to literature, this work presents the first assistance chatbot that encapsulates all features that may be needed by a university student to facilitate and improve his/her learning process. The results demonstrated clearly in the body of the paper showed the success achieved by the academic assistant proposed and built in this work in all its features or modules to offer help to university students and graduates

Indonesian Journal of Electrical Engineering and Computer Science

Are Your Keywords Like My Queries? A Corpus-Wide Evaluation of Keyword Extractors with Real Searches

Author: D. Ruggiero Lo Sardo
Emanuele Brugnoli
Giulio Prevedello
Martina Galletti
Pietro Gravino
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2025
Field of study

Keyword Extraction (KE) is essential in Natural Language Processing (NLP) for identifying key terms that represent the main themes of a text, and it is vital for applications such as information retrieval, text summarisation, and document classification. Despite the development of various KE methods—including statistical approaches and advanced deep learning models—evaluating their effectiveness remains challenging. Current evaluation metrics focus on keyword quality, balance, and overlap with annotations from authors and professional indexers, but neglect real-world information retrieval needs. This paper introduces a novel evaluation method designed to overcome this limitation by using real query data from Google Trends and can be used with both supervised and unsupervised KE approaches. We applied this method to three popular KE approaches (YAKE, RAKE and KeyBERT) and found that KeyBERT was the most effective in capturing users’ top queries, with RAKE also showing surprisingly good performance. The code is open-access and publicly available

Archivio della ricerca- Università di Roma La Sapienza

Analysis of social trends based on artificial intelligence techniques

Author: Pérez Lozano Albert
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/10/2022
Field of study

In order to analyze and extract information about social trends in the Spanish and Portuguese environment from an objective point of view, the implementation of this project was requested. This consists of extracting information from different and varied sources of internet information through the web scraping technique, and then using artificial intelligence techniques to process the texts obtained and extract keywords. Finally, two different ways of presenting the obtained results have been created in order to extract as much insights as possible from them.Con el fin de analizar y extraer información sobre las tendencias sociales en el entorno español y portugués desde un punto de vista objetivo, se solicitó la implementación de este proyecto. Este consiste en extraer información de diferentes y variadas fuentes de información de internet a través de la técnica del "web scraping", y posteriormente utilizar técnicas de inteligencia artificial para procesar los textos obtenidos y extraer palabras clave. Por último, se han creado dos formas diferentes de presentar los resultados obtenidos, con el fin de extraer de ellos la mayor cantidad de información posible.Per tal d'analitzar i extreure informació sobre les tendències socials de l'entorn espanyol i portuguès des d'un punt de vista objectiu, es va sol·licitar la implementació d'aquest projecte. Aquest consisteix en extreure informació de diferents i variades fonts d'informació d'internet mitjançant la tècnica del "web scraping", i després utilitzar tècniques d'intel·ligència artificial per processar els textos obtinguts i extreure'n paraules clau. Finalment, s'han creat dues maneres diferents de presentar els resultats obtinguts, per tal d'obtenir-ne el màxim d'informació possible

UPCommons (Universitat Politècnica de Catalunya)

Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students

Author: Ha Jinwoo
Kim Dongsoo
Publication venue
Publication date: 18/07/2023
Field of study

This study aims to explore user acceptance of Autonomous Vehicle (AV) policies with improved text-mining methods. Recently, South Korean policymakers have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as next-generation means of transportation that will reduce the cost of transporting passengers and goods. They support the construction of V2I and V2V communication infrastructures for ADC and recognize that ADR is equivalent to pedestrians to promote its deployment into sidewalks. To fill the gap where end-user acceptance of these policies is not well considered, this study applied two text-mining methods to the comments of graduate students in the fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient, and the other is the Contextual Semantic Network Analysis (C-SNA) based on both KeyBERT, which extracts keywords that contextually represent the comments, and double cosine similarity. The reason for comparing these approaches is to balance interest not only in the implications for the AV policies but also in the need to apply quality text mining to this research domain. Significantly, the limitation of frequency-based text mining, which does not reflect textual context, and the trade-off of adjusting thresholds in Semantic Network Analysis (SNA) were considered. As the results of comparing the two approaches, the C-SNA provided the information necessary to understand users' voices using fewer nodes and features than the CNA. The users who pre-emptively understood the AV policies based on their engineering literacy and the given texts revealed potential risks of the AV accident policies. This study adds suggestions to manage these risks to support the successful deployment of AVs on public roads.Comment: 29 pages with 11 figure

arXiv.org e-Print Archive

A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction

Author: Hurtado Samuel Sarria
Mullen Todd
Onodera Taku
Sheridan Paul
Publication venue
Publication date: 01/01/2024
Field of study

A term in a corpus is said to be ``bursty'' (or overdispersed) when its occurrences are concentrated in few out of many documents. In this paper, we propose Residual Inverse Collection Frequency (RICF), a statistical significance test inspired heuristic for quantifying term burstiness. The chi-squared test is, to our knowledge, the sole test of statistical significance among existing term burstiness measures. Chi-squared test term burstiness scores are computed from the collection frequency statistic (i.e., the proportion that a specified term constitutes in relation to all terms within a corpus). However, the document frequency of a term (i.e., the proportion of documents within a corpus in which a specific term occurs) is exploited by certain other widely used term burstiness measures. RICF addresses this shortcoming of the chi-squared test by virtue of its term burstiness scores systematically incorporating both the collection frequency and document frequency statistics. We evaluate the RICF measure on a domain-specific technical terminology extraction task using the GENIA Term corpus benchmark, which comprises 2,000 annotated biomedical article abstracts. RICF generally outperformed the chi-squared test in terms of precision at k score with percent improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61% (P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive with the performances of other well-established measures of term burstiness. Based on these findings, we consider our contributions in this paper as a promising starting point for future exploration in leveraging statistical significance testing in text analysis.Comment: 19 pages, 1 figure, 6 table

arXiv.org e-Print Archive

Applying Transformer-based Text Summarization for Keyphrase Generation

Author: Glazkova Anna
Morozov Dmitry
Publication venue
Publication date: 06/10/2022
Field of study

Keyphrases are crucial for searching and systematizing scholarly documents. Most current methods for keyphrase extraction are aimed at the extraction of the most significant words in the text. But in practice, the list of keyphrases often includes words that do not appear in the text explicitly. In this case, the list of keyphrases represents an abstractive summary of the source text. In this paper, we experiment with popular transformer-based models for abstractive text summarization using four benchmark datasets for keyphrase extraction. We compare the results obtained with the results of common unsupervised and supervised methods for keyphrase extraction. Our evaluation shows that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERTScore. However, they produce a lot of words that are absent in the author's list of keyphrases, which makes summarization models ineffective in terms of ROUGE-1. We also investigate several ordering strategies to concatenate target keyphrases. The results showed that the choice of strategy affects the performance of keyphrase generation.Comment: 15 pages, 4 figures. DAMDID-202

arXiv.org e-Print Archive

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Author: Golchin Shahriar
Kiapour Ata
Surdeanu Mihai
Tavabi Nazgol
Publication venue
Publication date: 14/07/2023
Field of study

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).Comment: final version: accepted at ACL'23 RepL4NLP. arXiv admin note: text overlap with arXiv:2208.1236

arXiv.org e-Print Archive

Analysis of social trends based on artificial intelligence techniques

Author: Pérez Lozano Albert
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/10/2022
Field of study

UPCommons. Portal del coneixement obert de la UPC

Development of an innovative and transparent symptom checker with a focus on natural language processing

Author: Tumbrägel Tara-Sophia
Publication venue
Publication date: 14/12/2022
Field of study

This paper presents the creation of a medical symptom checker with state-of-the-art machine and deep learning technologies. It examines the use and development of a natural language processing model which is trained on medical datasets. The model is discussed, highlighting its advantages and disadvantages for the study. Moreover, the paper introduces a web application which provides a user-friendly interface, allowing users to interact with the models and showcase the results. Finally, the paper offers an outlook on the future use cases of the application and how it may improve healthcare outcomes

Repositório da Universidade Nova de Lisboa