7 research outputs found
Similarity Detection for Hadith of Fiqh of Women using Cosine Similarity and Boyer Moore Method
Nowadays, people can get information easily including about
fiqh and hadith as a source of Islamic law. The problem is,
there are so many articles about jurisprudence whose
understanding refers to the laws or rules relating to the hadith
whose validity cannot be ascertained. The study aims to
determine the degree of similarity between the hadith
contained in articles with reliable sources such as books and
books. One of the outputs of this study is an application that
can determine the similarity of hadith using Cosine Similarity
and Boyer Moore by matching strings starting from the right
position to the leftmost position and using the cosine
similarity method to determine the similarity based on the
calculation of the distance between vectors A and B that
produce angles cosine x between the two vectors. In the
testing phase, the proposed model can run as planned. In one test scenario, the number of keywords tested was 9 cases compared to the categories in the database with an accuracy of 80%. And determine the similarity of two or more objects Using the cosine similarity method with weights The percentage of similarity is proportional to the sample of words entered, which is equal to 36%
DETEKSI GENOMIC REPEATS MENGGUNAKAN ALGORITMA BOYER-MOORE DENGAN APACHE SPARK STREAMING
Dalam satu dekade terakhir para ilmuwan harus melakukan penelitian laboratorium
selama 3 tahun untuk menganalisa DNA. Salah satu kasus dari analisa DNA yang
membutuhkan waktu dan tenaga dalam skala besar tersebut adalah untuk
menganalisa penyakit yang disebabkan oleh pola genom yang berulang atau disebut
dengan genomic repeats. Dalam menganalisa masalah genomic repeats dilakukan
analisa string matching atau pattern matching dimana akan mencari sebuah pola
dalam sebuah teks yang berukuran besar. Algoritma Boyer-Moore memproses pola
dan membuat dua tabel, yang dikenal sebagai tabel Boyer-Moore Bad Character
(bmBc) dan tabel Boyer-Moore good-suffix (bmGs). Untuk setiap karakter dalam
set alfabet, tabel bad character menyimpan nilai pergeseran berdasarkan
kemunculan karakter dalam pola. Algoritma ini membentuk dasar untuk beberapa
algoritma pencocokan pola. Untuk itu, penelitian ini membuat sebuah model
komputasi untuk mendapatkan pola genom yang berulang atau genomic repeats
secara cepat dan efektif dengan memodifikasi dan mengimplementasikan algoritma
Boyer-Moore pada Big Data Platform yaitu Apache Spark Streaming. Hasil
penelitian ini menunjukkan adanya percepatan antara penggunaan Big Data
platform dengan perancangan 2 skenario. Skenario pertama yaitu penggunaan
cluster dengan 4 cores dan beberapa worker node dan skenario kedua yaitu
penggunaan cluster dengan 2 worker node dan beberapa jumlah core. Penelitian ini
juga membuktikan bahwa model komputasi yang dibangun menunjukkan adanya
percepatan terhadap penelitian terdahulu dengan menggunakan stand alone.
In the past decade scientists have been doing laboratory research for 3 years to
analyze DNA. One of the cases of DNA analysis that requires time and effort on a
large scale is to analyze diseases caused by repetitive genomic patterns or called
genomic repeats. In analyzing the problem of genomic repeats an analysis of string
matching or pattern matching is carried out which will look for a pattern in a large
text. The Boyer-Moore algorithm processes patterns and creates two tables, known
as the Boyer-Moore Bad Character (bmBc) table and the Boyer-Moore good-suffix
(bmGs) table. For each character in the alphabet set, bad character tables store
shift values based on the appearance of characters in the pattern. This algorithm
forms the basis for several pattern matching algorithms. For this reason, this
research creates a computational model to get repetitive genomic patterns or
genomic repeats quickly and effectively by modifying and implementing the Boyer-
Moore algorithm on the Big Data Platform, namely Apache Spark Streaming. The
results of this study indicate an acceleration between the use of Big Data platforms
with the design of 2 scenarios. The first scenario is the use of clusters with 4 cores
and several worker nodes and the second scenario is the use of clusters with 2
worker nodes and a number of cores. This study also proves that the computational
model that was built shows the acceleration of previous research using stand alone
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis
BACKGROUND: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2(-L )distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. RESULTS: The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. CONCLUSIONS: The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems