14 research outputs found
Dental CLAIRES: Contrastive LAnguage Image REtrieval Search for Dental Research
Learning about diagnostic features and related clinical information from
dental radiographs is important for dental research. However, the lack of
expert-annotated data and convenient search tools poses challenges. Our primary
objective is to design a search tool that uses a user's query for oral-related
research. The proposed framework, Contrastive LAnguage Image REtrieval Search
for dental research, Dental CLAIRES, utilizes periapical radiographs and
associated clinical details such as periodontal diagnosis, demographic
information to retrieve the best-matched images based on the text query. We
applied a contrastive representation learning method to find images described
by the user's text by maximizing the similarity score of positive pairs (true
pairs) and minimizing the score of negative pairs (random pairs). Our model
achieved a hit@3 ratio of 96% and a Mean Reciprocal Rank (MRR) of 0.82. We also
designed a graphical user interface that allows researchers to verify the
model's performance with interactions.Comment: 10 pages, 7 figures, 4 table
Clustering and Bootstrapping Based Framework for News Knowledge Base Completion
Extracting the facts, namely entities and relations, from unstructured sources is an essential step in any knowledge base construction. At the same time, it is also necessary to ensure the completeness of the knowledge base by incrementally extracting the new facts from various sources. To date, the knowledge base completion is studied as a problem of knowledge refinement where the missing facts are inferred by reasoning about the information already present in the knowledge base. However, facts missed while extracting the information from multilingual sources are ignored. Hence, this work proposed a generic framework for knowledge base completion to enrich a knowledge base of crime-related facts extracted from online news articles in the English language, with the facts extracted from low resourced Indian language Hindi news articles. Using the framework, information from any low-resourced language news articles can be extracted without using language-specific tools like POS tags and using an appropriate machine translation tool. To achieve this, a clustering algorithm is proposed, which explores the redundancy among the bilingual collection of news articles by representing the clusters with knowledge base facts unlike the existing Bag of Words representation. From each cluster, the facts extracted from English language articles are bootstrapped to extract the facts from comparable Hindi language articles. This way of bootstrapping within the cluster helps to identify the sentences from a low-resourced language that are enriched with new information related to the facts extracted from a high-resourced language like English. The empirical result shows that the proposed clustering algorithm produced more accurate and high-quality clusters for monolingual and cross-lingual facts, respectively. Experiments also proved that the proposed framework achieves a high recall rate in extracting the new facts from Hindi news articles
Human-like summaries from heterogeneous and time-windowed software development artefacts
First Online: 02 September 2020Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software artefacts, and (2) because it is unclear what information a developer seeks in such a multi-document summary. We present the first framework for summarising multi-document software artefacts containing heterogeneous data within a given time frame. To produce human-like summaries, we employ a range of iterative heuristics to minimise the cosine-similarity between texts and high-dimensional feature vectors. A first study shows that users find the automatically generated summaries the most useful when they are generated using word similarity and based on the eight most relevant software artefacts.Mahfouth Alghamdi, Christoph Treude, Markus Wagne
LoNe Sampler: Graph node embeddings by coordinated local neighborhood sampling
Local graph neighborhood sampling is a fundamental computational problem that
is at the heart of algorithms for node representation learning. Several works
have presented algorithms for learning discrete node embeddings where graph
nodes are represented by discrete features such as attributes of neighborhood
nodes. Discrete embeddings offer several advantages compared to continuous
word2vec-like node embeddings: ease of computation, scalability, and
interpretability. We present LoNe Sampler, a suite of algorithms for generating
discrete node embeddings by Local Neighborhood Sampling, and address two
shortcomings of previous work. First, our algorithms have rigorously understood
theoretical properties. Second, we show how to generate approximate explicit
vector maps that avoid the expensive computation of a Gram matrix for the
training of a kernel model. Experiments on benchmark datasets confirm the
theoretical findings and demonstrate the advantages of the proposed methods.Comment: Accepted to AAAI 2023. arXiv admin note: substantial text overlap
with arXiv:2102.0477
Boolean logic algebra driven similarity measure for text based applications
In Information Retrieval (IR), Data Mining (DM), and Machine Learning (ML), similarity measures have been widely used for text clustering and classification. The similarity measure is the cornerstone upon which the performance of most DM and ML algorithms is completely dependent. Thus, till now, the endeavor in literature for an effective and efficient similarity measure is still immature. Some recently-proposed similarity measures were effective, but have a complex design and suffer from inefficiencies. This work, therefore, develops an effective and efficient similarity measure of a simplistic design for text-based applications. The measure developed in this work is driven by Boolean logic algebra basics (BLAB-SM), which aims at effectively reaching the desired accuracy at the fastest run time as compared to the recently developed state-of-the-art measures. Using the term frequency–inverse document frequency (TF-IDF) schema, the K-nearest neighbor (KNN), and the K-means clustering algorithm, a comprehensive evaluation is presented. The evaluation has been experimentally performed for BLAB-SM against seven similarity measures on two most-popular datasets, Reuters-21 and Web-KB. The experimental results illustrate that BLAB-SM is not only more efficient but also significantly more effective than state-of-the-art similarity measures on both classification and clustering tasks
A set theory based similarity measure for text clustering and classification
© 2020, The Author(s). Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency
Perbandingan dokumen Modul Ajar Mata Pelajaran Informatika Sekolah Menengah Atas (SMA) berbasis Similarity
ABSTRAK
Pemerintah telah merilis kebijakan Kurikulum Merdeka pada tahun 2022. Kebijakan tersebut mengubah penggunaan kurikulum sebelumnya yaitu kurikulum 2013. Dalam pelaksanaan kurikulum merdeka pada jenjang Sekolah Menengah Atas (SMA) yaitu pada Fase E, mata pelajaran informatika memiliki elemen materi sebanyak delapan elemen dasar yang dapat dikembangkan secara mandiri pada setiap sekolah. Dalam praktik pengembangan modul ajar oleh guru di sekolah belum ada evaluasi terukur dari pemerintah untuk mengukur pengembangan konten materi apakah sudah sesuai dengan kurikulum standar yang ada. Pada penelitian ini mengusulkan analisis similarity dokumen modul ajar mata pelajaran informatika dengan pembanding dokumen Computer Science Curricula 2013. Dengan mengusulkan metode Cosine Similarity mendapatkan hasil rata-rata similarity sebesar 0.29606 sedangkan dengan menggunakan metode Word2Vec mendapatkan hasil skor similarity dengan rata-rata similarity sebesar 0.98449. Pada tahap pemenuhan Knowledge Area dari skor similarity yang didapatkan dengan metode Cosine-similarity pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 29.6%, sedangkan dengan metode Word2Vec pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 98.44%. Pada tahap pengukuran akurasi dengan metode Cosine-similarity mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.6525 dan dengan metode Word2Vec mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.5895.
ABSTRACT
The government has issued an Independent Curriculum policy in 2022. This policy changes the use of the previous curriculum, namely the 2013 curriculum. In implementing the independent curriculum at the Senior High School (SMA) level, namely in Phase E, the informatics subject has eight basic material elements that can be developed independently at each school. In the practice of open development modules by teachers in schools, there has been no measurable evaluation from the government to measure whether the content development material is in accordance with the existing standard curriculum. In this research, recommendations for similarity analysis of informatics teaching module documents are compared with the 2013 Computer Science Curriculum document comparison. With the addition of the Cosine Similarity method, we get an average similarity result of 0.29606, while using the Word2Vec method we get a similarity score result with an average similarity of 0.98449. At the Knowledge Area stage, the similarity scores obtained using the Cosine-similarity method for all Knowledge Areas tested obtained a satisfactory average of 29.6%, while using the Word2Vec method for all Knowledge Areas tested obtained a satisfactory average of 98.44%. At the effectiveness stage of measurement using the Cosine-similarity method, the average accuracy in all Knowledge Areas tested was 0.6525 and with the Word2Vec method, the average accuracy in all Knowledge Areas tested was 0.5895.
مستخلص البحث
مقارنة وثائق وحدة التدريس لمواد المعلوماتية في المدرسة أصدرت الحكومة منهج دراسة "مردكا" في عام 2022. وتغير هذا المنهج استخدام المنهج السابق وهو منهج دراسة "2013". عند تنفيذ منهج "مردكا" على مستوى المدرسة الثانوية في مرحلة "ه"، يحتوي موضوع المعلوماتية على ثمانية عناصر مادية أساسية يمكن تطويرها بشكل مستقل في كل مدرسة. في ممارسة تطوير وحدات التدريس من قبل المعلمين في المدارس، لم يكن هناك تقييم قابل للقياس من قبل الحكومة لقياس ما إذا كان تطوير محتوى المواد يتوافق مع المنهج القياسي الحالي. في هذا البحث يقترح الباحث تحليل "سيميلاريتي" لوثائق وحدة تدريس المعلوماتية من خلال مقارنة وثائق "كمبيوتر ساينس كرّيكلار 2013". ومن خلال اقتراح طريقة تشابه جيب التمام (Cosine Similarity) حصل على نتيجة سيميلاريتي متوسطة قدره 0.29606. وأما استخدام طريقة " Word2Vec " فيؤدي إلى الحصول على درجة سيميلاريتي بمتوسط سيميلاريتي قدره 0.98449. في مرحلة استيفاء مجال المعرفة (Knowledge Area)، حصلت درجة سيميلاريتي التي تم الحصول عليها من طريقة "كوسين-سيميلاريتي" لجميع مجالات المعرفة التي تم اختبارها على متوسطة استيفاء قدرها 29.6%. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على متوسطة استيفاء قدرها 98.44%. وفي مرحلة قياس الدقة، حصلت طريقة "كوسين-سيميلاريتي" لجميع المجالات المعرفة التي تم اختبارها على دقة متوسطة قدرها 0.6525. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على دقة متوسطة قدرها 0.5895
Cosine-based explainable matrix factorization for collaborative filtering recommendation.
Recent years saw an explosive growth in the amount of digital information and the number of users who interact with this information through various platforms, ranging from web services to mobile applications and smart devices. This increase in information and users has naturally led to information overload which inherently limits the capacity of users to discover and find their needs among the staggering array of options available at any given time, the majority of which they may never become aware of. Online services have handled this information overload by using algorithmic filtering tools that can suggest relevant and personalized information to users. These filtering methods, known as Recommender Systems (RS), have become essential to recommend a range of relevant options in diverse domains ranging from friends, courses, music, and restaurants, to movies, books, and travel recommendations. Most research on recommender systems has focused on developing and evaluating models that can make predictions efficiently and accurately, without taking into account other desiderata such as fairness and transparency which are becoming increasingly important to establish trust with human users. For this reason, researchers have been recently pressed to develop recommendation systems that are endowed with the increased ability to explain why a recommendation is given, and hence help users make more informed decisions. Nowadays, state of the art Machine Learning (ML) techniques are being used to achieve unprecedented levels of accuracy in recommender systems. Unfortunately, most models are notorious for being black box models that cannot explain their output predictions. One such example is Matrix Factorization, a technique that is widely used in Collaborative Filtering algorithms. Unfortunately, like all black box machine learning models, MF is unable to explain its outputs. This dissertation proposes a new Cosine-based explainable Matrix Factorization model (CEMF) that incorporates a user-neighborhood explanation matrix (NSE) and incorporates a cosine based penalty in the objective function to encourage predictions that are explainable. Our evaluation experiments demonstrate that CEMF can recommend items that are more explainable and diverse compared to its competitive baselines, and that it further achieves this superior performance without sacrificing the accuracy of its predictions