14 research outputs found

    Dental CLAIRES: Contrastive LAnguage Image REtrieval Search for Dental Research

    Full text link
    Learning about diagnostic features and related clinical information from dental radiographs is important for dental research. However, the lack of expert-annotated data and convenient search tools poses challenges. Our primary objective is to design a search tool that uses a user's query for oral-related research. The proposed framework, Contrastive LAnguage Image REtrieval Search for dental research, Dental CLAIRES, utilizes periapical radiographs and associated clinical details such as periodontal diagnosis, demographic information to retrieve the best-matched images based on the text query. We applied a contrastive representation learning method to find images described by the user's text by maximizing the similarity score of positive pairs (true pairs) and minimizing the score of negative pairs (random pairs). Our model achieved a hit@3 ratio of 96% and a Mean Reciprocal Rank (MRR) of 0.82. We also designed a graphical user interface that allows researchers to verify the model's performance with interactions.Comment: 10 pages, 7 figures, 4 table

    Clustering and Bootstrapping Based Framework for News Knowledge Base Completion

    Get PDF
    Extracting the facts, namely entities and relations, from unstructured sources is an essential step in any knowledge base construction. At the same time, it is also necessary to ensure the completeness of the knowledge base by incrementally extracting the new facts from various sources. To date, the knowledge base completion is studied as a problem of knowledge refinement where the missing facts are inferred by reasoning about the information already present in the knowledge base. However, facts missed while extracting the information from multilingual sources are ignored. Hence, this work proposed a generic framework for knowledge base completion to enrich a knowledge base of crime-related facts extracted from online news articles in the English language, with the facts extracted from low resourced Indian language Hindi news articles. Using the framework, information from any low-resourced language news articles can be extracted without using language-specific tools like POS tags and using an appropriate machine translation tool. To achieve this, a clustering algorithm is proposed, which explores the redundancy among the bilingual collection of news articles by representing the clusters with knowledge base facts unlike the existing Bag of Words representation. From each cluster, the facts extracted from English language articles are bootstrapped to extract the facts from comparable Hindi language articles. This way of bootstrapping within the cluster helps to identify the sentences from a low-resourced language that are enriched with new information related to the facts extracted from a high-resourced language like English. The empirical result shows that the proposed clustering algorithm produced more accurate and high-quality clusters for monolingual and cross-lingual facts, respectively. Experiments also proved that the proposed framework achieves a high recall rate in extracting the new facts from Hindi news articles

    Human-like summaries from heterogeneous and time-windowed software development artefacts

    Get PDF
    First Online: 02 September 2020Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software artefacts, and (2) because it is unclear what information a developer seeks in such a multi-document summary. We present the first framework for summarising multi-document software artefacts containing heterogeneous data within a given time frame. To produce human-like summaries, we employ a range of iterative heuristics to minimise the cosine-similarity between texts and high-dimensional feature vectors. A first study shows that users find the automatically generated summaries the most useful when they are generated using word similarity and based on the eight most relevant software artefacts.Mahfouth Alghamdi, Christoph Treude, Markus Wagne

    LoNe Sampler: Graph node embeddings by coordinated local neighborhood sampling

    Full text link
    Local graph neighborhood sampling is a fundamental computational problem that is at the heart of algorithms for node representation learning. Several works have presented algorithms for learning discrete node embeddings where graph nodes are represented by discrete features such as attributes of neighborhood nodes. Discrete embeddings offer several advantages compared to continuous word2vec-like node embeddings: ease of computation, scalability, and interpretability. We present LoNe Sampler, a suite of algorithms for generating discrete node embeddings by Local Neighborhood Sampling, and address two shortcomings of previous work. First, our algorithms have rigorously understood theoretical properties. Second, we show how to generate approximate explicit vector maps that avoid the expensive computation of a Gram matrix for the training of a kernel model. Experiments on benchmark datasets confirm the theoretical findings and demonstrate the advantages of the proposed methods.Comment: Accepted to AAAI 2023. arXiv admin note: substantial text overlap with arXiv:2102.0477

    Boolean logic algebra driven similarity measure for text based applications

    Get PDF
    In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks

    A set theory based similarity measure for text clustering and classification

    Get PDF
    © 2020, The Author(s). Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency

    Perbandingan dokumen Modul Ajar Mata Pelajaran Informatika Sekolah Menengah Atas (SMA) berbasis Similarity

    Get PDF
    ABSTRAK Pemerintah telah merilis kebijakan Kurikulum Merdeka pada tahun 2022. Kebijakan tersebut mengubah penggunaan kurikulum sebelumnya yaitu kurikulum 2013. Dalam pelaksanaan kurikulum merdeka pada jenjang Sekolah Menengah Atas (SMA) yaitu pada Fase E, mata pelajaran informatika memiliki elemen materi sebanyak delapan elemen dasar yang dapat dikembangkan secara mandiri pada setiap sekolah. Dalam praktik pengembangan modul ajar oleh guru di sekolah belum ada evaluasi terukur dari pemerintah untuk mengukur pengembangan konten materi apakah sudah sesuai dengan kurikulum standar yang ada. Pada penelitian ini mengusulkan analisis similarity dokumen modul ajar mata pelajaran informatika dengan pembanding dokumen Computer Science Curricula 2013. Dengan mengusulkan metode Cosine Similarity mendapatkan hasil rata-rata similarity sebesar 0.29606 sedangkan dengan menggunakan metode Word2Vec mendapatkan hasil skor similarity dengan rata-rata similarity sebesar 0.98449. Pada tahap pemenuhan Knowledge Area dari skor similarity yang didapatkan dengan metode Cosine-similarity pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 29.6%, sedangkan dengan metode Word2Vec pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 98.44%. Pada tahap pengukuran akurasi dengan metode Cosine-similarity mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.6525 dan dengan metode Word2Vec mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.5895. ABSTRACT The government has issued an Independent Curriculum policy in 2022. This policy changes the use of the previous curriculum, namely the 2013 curriculum. In implementing the independent curriculum at the Senior High School (SMA) level, namely in Phase E, the informatics subject has eight basic material elements that can be developed independently at each school. In the practice of open development modules by teachers in schools, there has been no measurable evaluation from the government to measure whether the content development material is in accordance with the existing standard curriculum. In this research, recommendations for similarity analysis of informatics teaching module documents are compared with the 2013 Computer Science Curriculum document comparison. With the addition of the Cosine Similarity method, we get an average similarity result of 0.29606, while using the Word2Vec method we get a similarity score result with an average similarity of 0.98449. At the Knowledge Area stage, the similarity scores obtained using the Cosine-similarity method for all Knowledge Areas tested obtained a satisfactory average of 29.6%, while using the Word2Vec method for all Knowledge Areas tested obtained a satisfactory average of 98.44%. At the effectiveness stage of measurement using the Cosine-similarity method, the average accuracy in all Knowledge Areas tested was 0.6525 and with the Word2Vec method, the average accuracy in all Knowledge Areas tested was 0.5895. مستخلص البحث مقارنة وثائق وحدة التدريس لمواد المعلوماتية في المدرسة أصدرت الحكومة منهج دراسة "مردكا" في عام 2022. وتغير هذا المنهج استخدام المنهج السابق وهو منهج دراسة "2013". عند تنفيذ منهج "مردكا" على مستوى المدرسة الثانوية في مرحلة "ه"، يحتوي موضوع المعلوماتية على ثمانية عناصر مادية أساسية يمكن تطويرها بشكل مستقل في كل مدرسة. في ممارسة تطوير وحدات التدريس من قبل المعلمين في المدارس، لم يكن هناك تقييم قابل للقياس من قبل الحكومة لقياس ما إذا كان تطوير محتوى المواد يتوافق مع المنهج القياسي الحالي. في هذا البحث يقترح الباحث تحليل "سيميلاريتي" لوثائق وحدة تدريس المعلوماتية من خلال مقارنة وثائق "كمبيوتر ساينس كرّيكلار 2013". ومن خلال اقتراح طريقة تشابه جيب التمام (Cosine Similarity) حصل على نتيجة سيميلاريتي متوسطة قدره 0.29606. وأما استخدام طريقة " Word2Vec " فيؤدي إلى الحصول على درجة سيميلاريتي بمتوسط سيميلاريتي قدره 0.98449. في مرحلة استيفاء مجال المعرفة (Knowledge Area)، حصلت درجة سيميلاريتي التي تم الحصول عليها من طريقة "كوسين-سيميلاريتي" لجميع مجالات المعرفة التي تم اختبارها على متوسطة استيفاء قدرها 29.6%. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على متوسطة استيفاء قدرها 98.44%. وفي مرحلة قياس الدقة، حصلت طريقة "كوسين-سيميلاريتي" لجميع المجالات المعرفة التي تم اختبارها على دقة متوسطة قدرها 0.6525. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على دقة متوسطة قدرها 0.5895

    Cosine-based explainable matrix factorization for collaborative filtering recommendation.

    Get PDF
    Recent years saw an explosive growth in the amount of digital information and the number of users who interact with this information through various platforms, ranging from web services to mobile applications and smart devices. This increase in information and users has naturally led to information overload which inherently limits the capacity of users to discover and find their needs among the staggering array of options available at any given time, the majority of which they may never become aware of. Online services have handled this information overload by using algorithmic filtering tools that can suggest relevant and personalized information to users. These filtering methods, known as Recommender Systems (RS), have become essential to recommend a range of relevant options in diverse domains ranging from friends, courses, music, and restaurants, to movies, books, and travel recommendations. Most research on recommender systems has focused on developing and evaluating models that can make predictions efficiently and accurately, without taking into account other desiderata such as fairness and transparency which are becoming increasingly important to establish trust with human users. For this reason, researchers have been recently pressed to develop recommendation systems that are endowed with the increased ability to explain why a recommendation is given, and hence help users make more informed decisions. Nowadays, state of the art Machine Learning (ML) techniques are being used to achieve unprecedented levels of accuracy in recommender systems. Unfortunately, most models are notorious for being black box models that cannot explain their output predictions. One such example is Matrix Factorization, a technique that is widely used in Collaborative Filtering algorithms. Unfortunately, like all black box machine learning models, MF is unable to explain its outputs. This dissertation proposes a new Cosine-based explainable Matrix Factorization model (CEMF) that incorporates a user-neighborhood explanation matrix (NSE) and incorporates a cosine based penalty in the objective function to encourage predictions that are explainable. Our evaluation experiments demonstrate that CEMF can recommend items that are more explainable and diverse compared to its competitive baselines, and that it further achieves this superior performance without sacrificing the accuracy of its predictions
    corecore