    A set theory based similarity measure for text clustering and classification

    © 2020, The Author(s). Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency

    Prediksi Teks Wawancara Medis Berdasarkan Nilai Kemiripan Menggunakan Cosine Similarity

    Konsultasi kesehatan online semakin banyak digunakan terlebih selama masa pandemi COVID19. Proses tanya jawab yang menjadi bagian dari interview dokter dan pasien sebelum dilakukan diagnosis penyakit dimungkinkan tidak terjadi timbal balik pada media online tersebut. Masyarakat pada umumnya akan menuliskan lebih dari satu pertanyaan kepada dokter terkait satu kategori penyakit secara luas. Redaksi kalimat pada teks tersebut sering tidak terstruktur dan bercampur antara pertanyaan serta pernyataan. Banyaknya jumlah penanya dengan pertanyaan yang mungkin masih memiliki kemiripan membuat data konsultasi kesehatan terlihat tidak tertata dengan rapi. Pada Tugas Akhir ini akan dilakukan tahapan menggunakan pendekatan segmentasi teks yang dapat membantu pengguna atau pengunjung media konsultasi kesehatan online tersebut dalam mencari data pertanyaan terdahulu. =========================================================================================================================== Online health consultations are increasingly being used, especially during the COVID19 pandemic. The question and answer process which is part of the doctor and patient interview before a disease diagnosis is made is possible that there will be no reciprocity in the online media. The general public will generally write more than one question to a doctor regarding one broad category of disease. Sentences in the text are often unstructured and mixed up between questions and statements. The large number of questioners with questions that may still have similarities makes the health consultation data look unorganized. This Final Project will carry out stages using a text segmentation approach that can help users or visitors to the online health consultation media in finding previous question data

    Pairwise document similarity measure based on present term set

    Abstract Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results

    Computational innovation studies: understanding innovation studies through novel scientometric approaches

    A cientometria é uma importante área de investigação dedicada ao estudo quantitativo da ciência e está a expandir-se a um ritmo sem precedentes. Surgiu como um paradigma de avaliação e espera-se que ajude na resolução de problemas sociais complexos. Apesar da sua importância, pouco se sabe sobre os guardiões da ciência e os mecanismos de governação editorial mais amplos que ajudam a orientar os esforços científicos. Neste projeto, seguimos uma perspetiva pouco explorada (assumimos os conselhos editoriais e as revistas como veículo institucional), numa área específica de investigação científica (os Estudos de Inovação). Abordamos diferentes aspetos em três etapas: em primeiro lugar, produzimos um retrato abrangente do fenómeno editorial, sondando as características estruturais heterogéneas dos conselhos editoriais, que são dominados por editores masculinos, anglo-americanos que exibem uma concentração de 85% das posições editoriais em 20% dos países; em segundo lugar, comparamos os materiais publicitários das revistas (blurbs) com uma medida de semelhança do cosseno identificando seis revistas com mais de 80% de semelhança semântica com a "Research Policy" (a revista principal) e descobrimos que as revistas podem ser classificadas em quatro grupos; e em terceiro lugar, combinamos os resumos (abstracts) das revistas realmente publicados com a descrição publicitária, revelando que o conteúdo selecionado em cinco revistas teria tido maior interesse para outras. Por fim, desenvolvemos uma ferramenta interativa que permite comparar a semelhança dos conteúdos publicados pelas revistas. Estas estratégias de investigação apresentadas juntam-se ao portfólio de metodologias que os analistas de política científica podem usar para compreender sistematicamente as agendas de revistas, a fim de refletir sobre o que foi realizado e o que ainda está por fazer.Scientometrics is an important research field that is dedicated to the quantitative study of science and is expanding at an unprecedented rate. It emerged as an evaluation paradigm and is expected to assist in the resolution of complex societal problems. For years, the impact of research has been at the top of the agenda for policymakers, however little is known about the gatekeeping processes and the broader editorial governance mechanisms that helps steer scientific efforts. In this project, we will pursue an under-explored perspective (we take on editorial boards and the journals as an institutional vehicle) and apply to a specific field of academic research (Innovation Studies). We address different aspects in three steps: first, we provide a comprehensive portrait of the editorship phenomenon by probing the heterogeneous structural features of boards, which dominated by men and angloamerican editors displaying a concentration of 85% of editorial positions in 20% of the countries; second, we compare journals’ advertising materials (blurbs) with a cosine similarity measure identifying six journals with more than 80% semantic similarity with Research Policy (the leading journal) and find out that the journals can be classified into four groups; and third, we match journal blurbs with the abstracts of papers actually published disclosing that the contents from five journals would have greater interest to other outlets. Finally, an interactive tool was developed so that researchers are better empowered to compare the similarity of journals contents in the future. These research strategies presented add to the portfolio of methodologies that science policy analysts can use to systematically understand journal agendas in order to reflect on what has been accomplished and what remains to be done