294 research outputs found
Kombinasi Ekstraksi Kata Kunci dan Ekspansi Kueri Untuk Deteksi Isu Etik pada Ringkasan Penelitian Kesehatan
Penelitian kesehatan harus melalui proses telaah etik yang bertujuan untuk mengantisipasi dugaan atas risiko fisik, sosial, ekonomi dan psikologis. Secara etik penelitian kesehatan dapat diterima apabila pada penelitian tersebut mampu dibuktikan dengan metode ilmiah yang valid serta lulus uji etik sebelum penelitian dilakukan. Untuk memastikan pada ringkasan penelitian kesehatan terdapat aspek etik, dibutuhkan kata kunci yang dapat dijadikan representasi dari isi ringkasan tersebut. Salah satu pendekatan yang sering dilakukan adalah dengan menghitung frekuensi kemunculan kata dalam dokumen. Pendekatan lain yaitu pendekatan YAKE dan keyBERT yang tidak hanya menghitung frekuensi kata namun juga menghitung konteks kata. Selain melakukan ekstraksi dilakukan juga proses ekspansi kueri sebagai upaya memperluas istilah yang dapat mewakili masing-masing aspek etik. Salah satu pendekatan yang digunakan untuk ekspansi kueri adalah model word2Vec. Penelitian ini mengusulkan metode pengembangan ekspansi kueri dengan dan metode ekstraksi kata kunci seperti TFIDF,YAKE dan keyBERT dan mengombinasikannya dengan fuzzy. Hasil eksperimen menunjukkan bahwa metode paling unggul secara presisi yaitu YAKE dan gabungan antara TFIDF + YAKE + keyBERT dengan nilai tertinggi 46% kemudian dari untuk recall model YAKE mendapat nilai tertinggi dengan angka 72% dan untuk nilai F1-Score yang paling unggul adalah metode YAKE dengan nilai tertinggi 54%
Academic assistance chatbot-a comprehensive NLP and deep learning-based approaches
The rapid growth of digital technologies and natural language processing (NLP) have revolutionized the field of education, creating new demand for automated academic assistance systems. In this paper, we present an NLP-based academic assistance chatbot designed to provide comprehensive support to students and researchers using deep learning techniques. The chatbot incorporates a range of intelligent features to assist with university recommendations, article writing, automatic question answering (QA), and job search. By leveraging sentiment analysis and sarcasm detection models. The proposed chatbot could offer accurate and insightful university recommendations. Additionally, the chatbot incorporates spell and grammar checking, summarization, paraphrasing, and topic modeling capabilities to aid users in enhancing their writing skills. The QA module enables users to obtain quick and precise answers to factoid-based questions. Moreover, the chatbot helps with internships and job search. According to literature, this work presents the first assistance chatbot that encapsulates all features that may be needed by a university student to facilitate and improve his/her learning process. The results demonstrated clearly in the body of the paper showed the success achieved by the academic assistant proposed and built in this work in all its features or modules to offer help to university students and graduates
Are Your Keywords Like My Queries? A Corpus-Wide Evaluation of Keyword Extractors with Real Searches
Keyword Extraction (KE) is essential in Natural Language Processing (NLP) for identifying key terms that represent the main themes of a text, and it is vital for applications such as information retrieval, text summarisation, and document classification. Despite the development of various KE methods—including statistical approaches and advanced deep learning models—evaluating their effectiveness remains challenging. Current evaluation metrics focus on keyword quality, balance, and overlap with annotations from authors and professional indexers, but neglect real-world information retrieval needs. This paper introduces a novel evaluation method designed to overcome this limitation by using real query data from Google Trends and can be used with both supervised and unsupervised KE approaches. We applied this method to three popular KE approaches (YAKE, RAKE and KeyBERT) and found that KeyBERT was the most effective in capturing users’ top queries, with RAKE also showing surprisingly good performance. The code is open-access and publicly available
Analysis of social trends based on artificial intelligence techniques
In order to analyze and extract information about social trends in the Spanish and Portuguese environment from an objective point of view, the implementation of this project was requested. This consists of extracting information from different and varied sources of internet information through the web scraping technique, and then using artificial intelligence techniques to process the texts obtained and extract keywords. Finally, two different ways of presenting the obtained results have been created in order to extract as much insights as possible from them.Con el fin de analizar y extraer información sobre las tendencias sociales en el entorno español y portugués desde un punto de vista objetivo, se solicitó la implementación de este proyecto. Este consiste en extraer información de diferentes y variadas fuentes de información de internet a través de la técnica del "web scraping", y posteriormente utilizar técnicas de inteligencia artificial para procesar los textos obtenidos y extraer palabras clave. Por último, se han creado dos formas diferentes de presentar los resultados obtenidos, con el fin de extraer de ellos la mayor cantidad de información posible.Per tal d'analitzar i extreure informació sobre les tendències socials de l'entorn espanyol i portuguès des d'un punt de vista objectiu, es va sol·licitar la implementació d'aquest projecte. Aquest consisteix en extreure informació de diferents i variades fonts d'informació d'internet mitjançant la tècnica del "web scraping", i després utilitzar tècniques d'intel·ligència artificial per processar els textos obtinguts i extreure'n paraules clau. Finalment, s'han creat dues maneres diferents de presentar els resultats obtinguts, per tal d'obtenir-ne el màxim d'informació possible
Exploring acceptance of autonomous vehicle policies using KeyBERT and SNA: Targeting engineering students
This study aims to explore user acceptance of Autonomous Vehicle (AV)
policies with improved text-mining methods. Recently, South Korean policymakers
have viewed Autonomous Driving Car (ADC) and Autonomous Driving Robot (ADR) as
next-generation means of transportation that will reduce the cost of
transporting passengers and goods. They support the construction of V2I and V2V
communication infrastructures for ADC and recognize that ADR is equivalent to
pedestrians to promote its deployment into sidewalks. To fill the gap where
end-user acceptance of these policies is not well considered, this study
applied two text-mining methods to the comments of graduate students in the
fields of Industrial, Mechanical, and Electronics-Electrical-Computer. One is
the Co-occurrence Network Analysis (CNA) based on TF-IWF and Dice coefficient,
and the other is the Contextual Semantic Network Analysis (C-SNA) based on both
KeyBERT, which extracts keywords that contextually represent the comments, and
double cosine similarity. The reason for comparing these approaches is to
balance interest not only in the implications for the AV policies but also in
the need to apply quality text mining to this research domain. Significantly,
the limitation of frequency-based text mining, which does not reflect textual
context, and the trade-off of adjusting thresholds in Semantic Network Analysis
(SNA) were considered. As the results of comparing the two approaches, the
C-SNA provided the information necessary to understand users' voices using
fewer nodes and features than the CNA. The users who pre-emptively understood
the AV policies based on their engineering literacy and the given texts
revealed potential risks of the AV accident policies. This study adds
suggestions to manage these risks to support the successful deployment of AVs
on public roads.Comment: 29 pages with 11 figure
A statistical significance testing approach for measuring term burstiness with applications to domain-specific terminology extraction
A term in a corpus is said to be ``bursty'' (or overdispersed) when its
occurrences are concentrated in few out of many documents. In this paper, we
propose Residual Inverse Collection Frequency (RICF), a statistical
significance test inspired heuristic for quantifying term burstiness. The
chi-squared test is, to our knowledge, the sole test of statistical
significance among existing term burstiness measures. Chi-squared test term
burstiness scores are computed from the collection frequency statistic (i.e.,
the proportion that a specified term constitutes in relation to all terms
within a corpus). However, the document frequency of a term (i.e., the
proportion of documents within a corpus in which a specific term occurs) is
exploited by certain other widely used term burstiness measures. RICF addresses
this shortcoming of the chi-squared test by virtue of its term burstiness
scores systematically incorporating both the collection frequency and document
frequency statistics. We evaluate the RICF measure on a domain-specific
technical terminology extraction task using the GENIA Term corpus benchmark,
which comprises 2,000 annotated biomedical article abstracts. RICF generally
outperformed the chi-squared test in terms of precision at k score with percent
improvements of 0.00% (P@10), 6.38% (P@50), 6.38% (P@100), 2.27% (P@500), 2.61%
(P@1000), and 1.90% (P@5000). Furthermore, RICF performance was competitive
with the performances of other well-established measures of term burstiness.
Based on these findings, we consider our contributions in this paper as a
promising starting point for future exploration in leveraging statistical
significance testing in text analysis.Comment: 19 pages, 1 figure, 6 table
Applying Transformer-based Text Summarization for Keyphrase Generation
Keyphrases are crucial for searching and systematizing scholarly documents.
Most current methods for keyphrase extraction are aimed at the extraction of
the most significant words in the text. But in practice, the list of keyphrases
often includes words that do not appear in the text explicitly. In this case,
the list of keyphrases represents an abstractive summary of the source text. In
this paper, we experiment with popular transformer-based models for abstractive
text summarization using four benchmark datasets for keyphrase extraction. We
compare the results obtained with the results of common unsupervised and
supervised methods for keyphrase extraction. Our evaluation shows that
summarization models are quite effective in generating keyphrases in the terms
of the full-match F1-score and BERTScore. However, they produce a lot of words
that are absent in the author's list of keyphrases, which makes summarization
models ineffective in terms of ROUGE-1. We also investigate several ordering
strategies to concatenate target keyphrases. The results showed that the choice
of strategy affects the performance of keyphrase generation.Comment: 15 pages, 4 figures. DAMDID-202
Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords
We propose a novel task-agnostic in-domain pre-training method that sits
between generic pre-training and fine-tuning. Our approach selectively masks
in-domain keywords, i.e., words that provide a compact representation of the
target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We
evaluate our approach using six different settings: three datasets combined
with two distinct pre-trained language models (PLMs). Our results reveal that
the fine-tuned PLMs adapted using our in-domain pre-training strategy
outperform PLMs that used in-domain pre-training with random masking as well as
those that followed the common pre-train-then-fine-tune paradigm. Further, the
overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the
pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).Comment: final version: accepted at ACL'23 RepL4NLP. arXiv admin note: text
overlap with arXiv:2208.1236
Analysis of social trends based on artificial intelligence techniques
In order to analyze and extract information about social trends in the Spanish and Portuguese environment from an objective point of view, the implementation of this project was requested. This consists of extracting information from different and varied sources of internet information through the web scraping technique, and then using artificial intelligence techniques to process the texts obtained and extract keywords. Finally, two different ways of presenting the obtained results have been created in order to extract as much insights as possible from them.Con el fin de analizar y extraer información sobre las tendencias sociales en el entorno español y portugués desde un punto de vista objetivo, se solicitó la implementación de este proyecto. Este consiste en extraer información de diferentes y variadas fuentes de información de internet a través de la técnica del "web scraping", y posteriormente utilizar técnicas de inteligencia artificial para procesar los textos obtenidos y extraer palabras clave. Por último, se han creado dos formas diferentes de presentar los resultados obtenidos, con el fin de extraer de ellos la mayor cantidad de información posible.Per tal d'analitzar i extreure informació sobre les tendències socials de l'entorn espanyol i portuguès des d'un punt de vista objectiu, es va sol·licitar la implementació d'aquest projecte. Aquest consisteix en extreure informació de diferents i variades fonts d'informació d'internet mitjançant la tècnica del "web scraping", i després utilitzar tècniques d'intel·ligència artificial per processar els textos obtinguts i extreure'n paraules clau. Finalment, s'han creat dues maneres diferents de presentar els resultats obtinguts, per tal d'obtenir-ne el màxim d'informació possible
Development of an innovative and transparent symptom checker with a focus on natural language processing
This paper presents the creation of a medical symptom checker with state-of-the-art machine
and deep learning technologies. It examines the use and development of a natural language
processing model which is trained on medical datasets. The model is discussed, highlighting its
advantages and disadvantages for the study. Moreover, the paper introduces a web application
which provides a user-friendly interface, allowing users to interact with the models and showcase
the results. Finally, the paper offers an outlook on the future use cases of the application and
how it may improve healthcare outcomes
- …
