82 research outputs found

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    Pemetaan Lokasi Kejadian dalam Sistem Deteksi Kejadian dengan Data Twitter Menggunakan Teori Graf

    Get PDF
    Sistem deteksi kejadian dengan data twitter sebagai data masukannya telah banyak diteliti sebelumnya. Suatu informasi kejadian dapat dikategorikan sebagai kejadian penting adalah jika minimal terdapat informasi lokasi/tempat kejadian. Dalam sebuah tweet, seringkali terdapat lebih dari satu entitas lokasi kejadian karena kejadiannya memiliki dampak lebih dari satu titik lokasi. Hal ini berarti ada keterhubungan atau relasi antar lokasi kejadian sehingga diperlukan cara merubah relasi yang berupa teks ke dalam bentuk graf. Selain itu, banyak juga suatu informasi kejadian menulis entitas lokasi yang dengan ejaan yang berbeda atau terdapat kesalahan penulisan. Hal ini menyulitkan pemetaan lokasi terjadinya kejadian ke dalam maps karena entitas lokasi tidak terdapat di dalam gazetteer. Maka dari itu, diperlukan standardisasi entitas nama lokasi kejadian dengan Fast Approximate String Matching (FASM) dan mengkonversi relasi antar lokasi ke dalam graf. Dalam penelitian ini menunjukkan bahwa merubah relasi antar lokasi ke dalam bentuk graf yang sebelumnya telah dilakukan standardisasi entitas nama lokasi sangat memudahkan sistem dalam pemetaan atau memvisualisasikan informasi lokasi kejadian ke dalam GoogleMaps

    Ekstraksi Informasi Menggunakan Kombinasi Metode NeuroNER, Neural Relation Extraction, dan FASM pada Deteksi Kejadian dari Data Stream Twitter

    Get PDF
    Pemanfaatan twitter untuk deteksi kejadian bencana alam dan lalu-lintas telah dibahas dalam banyak penelitian yang sudah ada. Banyak Informasi yang dibagikan oleh pengguna Twitter dari akun individu maupun akun milik lembaga pemerintahan dan media yang berupa tweet informasi kejadian penting yang diperlukan oleh masyarakat. Dengan memanfaatkan API Twitter pengguna bisa mendapatkan data postingan Twitter secara bebas dan gratis berdasarkan kata kunci, id pengguna, dan geo-location yang diinginkan. Dalam penelitian ini digunakan gabungan metode NeuroNER, NeuralRE, dan FASM untuk deteksi kejadian dengan melakukan ekstraksi informasi pada data stream Twitter. Beberapa tahap penelitian dilakukan untuk mendapatkan hasil yang optimal. Pertama, tahap pengambilan data dan prapemrosesan. Kedua, informasi kejadian haruslah memiliki entitas lokasi yang valid. Untuk itu digunakan metode neuro named entity recognition (NeuroNER) untuk mengenali entitas bernama khususnya entitas lokasi pada data tweet. Ketiga, melakukan klasifikasi jenis kejadian kedalam empat kategori kejadian; non-informasi kejadian, bencana alam, lalu-lintas, dan kebakaran dengan menggunakan algoritma klasifikasi recurrent convolutional neural network (RCNN). Keempat, dilakukan proses ekstraksi relasi dengan NeuralRE untuk mendapatkan relasi antar entitas bernama. Kelima, standarisasi nama lokasi, geocoding, dan visualisasi data ke dalam peta digital. Penelitian menguji gabungan metode yang disusulkan secara parsial maupun keseleruhan. Gabungan metode yang diusulkan bekerja dengan baik untuk melalukan ekstraksi informasi mulai dari tahap streaming data hingga visualisasi data. Hal ini ditunjukkan dengan nilai rata-rata perhitungan precision, recall, dan f-measure secara keseluruhan masing-masing 94,28%, 94,16%, dan 94,22%. =================================================================================================== Utilization of twitter for the detection of natural disaster and traffic incidents has been discussed in many existing studies. Lots of Information about important incidents shared by Twitter users from personal, government agencies’, and media account are useful for the community. By utilizing Twitter API, users can get Twitter data for free based on keywords, user ids, or geo-locations as desired. We proposed combination of NeuroNER, NeuralRE, and FASM as incident detection method by extracting information from Twitter data stream. Our proposed method consists of five main steps. The first step is data retrieval and preprocessing. The second step is entity location recognition using Neuro Named Entity Recognition (NeuroNER) method to detect valid location of the incidents. The third step is event type classification using Recurrent Classification Algorithms Convolutional Neural Network (RCNN). The events are classified into four categories: non-information, natural disasters, traffic, and fire. The fourth step is entity relations extraction process using NeuralRE to identify relationships between named entities. The final step is standardization of the location name, geocoding, and data visualization onto digital map. This study examines combinations of some steps and the entire steps of proposed method. The proposed method works well enough in extracting information from streaming data step to data visualization with the average value of precision, recall, and f-measure 94,28%, 94,16%, and 94,22% respectively

    An improved Levenshtein algorithm for spelling correction word candidate list generation

    Get PDF
    Candidates’ list generation in spelling correction is a process of finding words from a lexicon that should be close to the incorrect word. The most widely used algorithm for generating candidates’ list for incorrect words is based on Levenshtein distance. However, this algorithm takes too much time when there is a large number of spelling errors. The reason is that calculating Levenshtein algorithm includes operations that create an array and fill the cells of this array by comparing the characters of an incorrect word with the characters of a word from a lexicon. Since most lexicons contain millions of words, then these operations will be repeated millions of times for each incorrect word to generate its candidates list. This dissertation improved Levenshtein algorithm by designing an operational technique that has been included in this algorithm. The proposed operational technique enhances Levenshtein algorithm in terms of the processing time of its executing without affecting its accuracy. It reduces the operations required to measure cells’ values in the first row, first column, second row, second column, third row, and third column in Levenshtein array. The improved Levenshtein algorithm was evaluated against the original algorithm. Experimental results show that the proposed algorithm outperforms Levenshtein algorithm in terms of the processing time by 36.45% while the accuracy of both algorithms is still the same

    Rancang Bangun Aplikasi Berbasis Web Sebagai Penyebaran Informasi Kejadian Secara Real-Time Menggunakan Kerangka Kerja Laravel

    Get PDF
    Berkembang pesatnya dunia teknologi informasi memudahkan manusia untuk bertukar informasi dengan cepat dan mudah melalui media sosial. Salah satu media sosial yang sering digunakan adalah twitter. Contoh dari user influence yang menggunakan twitter sebagai sarana berbagai informasi kejadian seperti BMKG, Dishub Surabaya, dan Radio Suara Surabaya. Kejadian penting seperti kecelakaan, kebakaran, kemacetan atau insiden lainnya yang dapat menimbulkan dampak negatif kepada masyarakat perlu diketahui sedini mungkin oleh masyarakat umum agar dapat terhindar dari dampak kejadian. Sistem ini dibuat untuk memfasilitasi masyarakat dan pihak berwenang dalam penyebaran informasi terkait kejadian penting secara daring. ====================================================================================================== The rapid development of the world of information technology makes it easier for humans to exchange information quickly and easily via social media. One of the social media that is often used is twitter. Examples of user influence who use twitter as a means of various information on events are BMKG, Dishub Surabaya, and Radio Suara Surabaya. Important events such as accidents, fires, traffic jams or other incidents that could have an impact negative to the public needs to be known as early as possible by the general public in order to avoid the impact of the incident. This system was created to facilitate the public and the authorities in disseminating information related to important events online

    Delovanje algoritma Jaro-Winkler glede na mesto pojavljanja tipografskih napak

    Get PDF

    An Ontology based Enhanced Framework for Instant Messages Filtering for Detection of Cyber Crimes

    Get PDF
    Instant messaging is very appealing and relatively new class of social interaction. Instant Messengers (IMs) and Social Networking Sites (SNS) may contain messages which are capable of causing harm, which are untraced, leading to obstruction for network communication and cyber security. User ignorance towards the use of communication services like Instant Messengers, emails, websites, social networks etc, is creating favourable conditions for cyber threat activity. It is required to create technical awareness in users by educating them to create a suspicious detection application which would generate alerts for the user so that suspicious messages are not ignored. Very limited research contributions were available in for detection of suspicious cyber threat activity in IM. A context based, dynamic and intelligent suspicious detection methodology in IMs is proposed, to analyse and detect cyber threat activity in Instant Messages with relevance to domain ontology (OBIE) and utilizes the Association rule mining for generating rules and alerting the victims, also analyses results with high ratio of precision and recall. The results have proved improvisation over the existing methods by showing the increased percentage of precision and recall. DOI: 10.17762/ijritcc2321-8169.15056

    Compressed Bit-sliced Signature Files An Index Structure for Large Lexicons

    Get PDF
    We use the signature file method to search for partially specified terms in large lexicons. To optimize efficiency, we use the concepts of the partially evaluated bit-sliced signature file method and memory resident data structures. Our system employs signature partitioning, compression, and term blocking. We derive equations to obtain system design parameters, and measure indexing efficiency in terms of time and space. The resulting approach provides good response time and is storage-efficient. In the experiments we use four different lexicons, and show that the signature file approach outperforms the inverted file approach in certain efficiency aspects. KEYWORDS: Lexicon search, n-grams, signature files
    • …
    corecore