5,633 research outputs found

    Novelty Detection by Latent Semantic Indexing

    Get PDF
    As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure

    Technology classification with latent semantic indexing

    Get PDF
    Many national and international governments establish organizations for applied science research funding. For this, several organizations have defined procedures for identifying relevant projects that based on prioritized technologies. Even for applied science research projects, which combine several technologies it is difficult to identify all corresponding technologies of all research-funding organizations. In this paper, we present an approach to support researchers and to support research-funding planners by classifying applied science research projects according to corresponding technologies of research-funding organizations. In contrast to related work, this problem is solved by considering results from literature concerning the application based technological relationships and by creating a new approach that is based on latent semantic indexing (LSI) as semantic text classification algorithm. Technologies that occur together in the process of creating an application are grouped in classes, semantic textual patterns are identified as representative for each class, and projects are assigned to one of these classes. This enables the assignment of each project to all technologies semantically grouped by use of LSI. This approach is evaluated using the example of defense and security based technological research. This is because the growing importance of this application field leads to an increasing number of research projects and to the appearance of many new technologies

    Analisis Latent Semantic Indexing Menggunakan QR Decomposition dengan Transformasi Householder Untuk Mencari Informasi

    Get PDF
    Perkembangan Information Retrieval telah berkembang dengan banyak metode yang berfungsi menghasilkan tingkat relevansi yang lebih baik. Untuk dapat menghasilkan nilai relevansi yang tinggi, agar maka diperlukan sebuah metode untuk menghasilkan perangkingan yang baik dan teruji. Pada Tugas Akhir ini melakukan analisis Latent Semantic Indexing menggunakan QR decompisition dengan transformasi householder, kemudian untuk mengukur kemiripan dokumen terhadap query menggunakan cosine similarity dan parameter pengujian akurasi sistem menggunakan recall dan precision supaya dapat membuktikan kemampuan dalam latent semantic indexing dapat menemukan dokumen yang diinginkan atau relevan walaupun tidak ada term yang ada pada query dan melakukan perbandingan waktu proses perncarian dokumen. Hasil pengujian dari tugas akhir ini menunjukan latent semantic indexing menggunakan QR Decomposition dengan transformasi householder terbukti bisa menemukan dokumen relevan walau tidak mengandung term yang terdapat pada query kemudian memiliki nilai recall dan precison nilai akurasi sistem yang baik dan juga mendapatkan proses waktu pencarian dokumen yang relevan yang cepat. Kata Kunci: Latent Semantic Indexing (LSI), QR Decomposition, Transformasi Householder, Recall, Precision

    Clustering and Latent Semantic Indexing Aspects of the Nonnegative Matrix Factorization

    Full text link
    This paper provides a theoretical support for clustering aspect of the nonnegative matrix factorization (NMF). By utilizing the Karush-Kuhn-Tucker optimality conditions, we show that NMF objective is equivalent to graph clustering objective, so clustering aspect of the NMF has a solid justification. Different from previous approaches which usually discard the nonnegativity constraints, our approach guarantees the stationary point being used in deriving the equivalence is located on the feasible region in the nonnegative orthant. Additionally, since clustering capability of a matrix decomposition technique can sometimes imply its latent semantic indexing (LSI) aspect, we will also evaluate LSI aspect of the NMF by showing its capability in solving the synonymy and polysemy problems in synthetic datasets. And more extensive evaluation will be conducted by comparing LSI performances of the NMF and the singular value decomposition (SVD), the standard LSI method, using some standard datasets.Comment: 28 pages, 5 figure

    Perangkingan Dokumen Berbahasa Arab Menggunakan Latent Semantic Indexing

    Full text link
    Berbagai metode perangkingan dokumen dalam aplikasi InformationRetrieval telah dikembangkan dan diimplementasikan. Salah satu metode yangsangat populer adalah perangkingan dokumen menggunakan vector space modelberbasis pada nilai term weighting TF.IDF. Metode tersebut hanya melakukanpembobotan term berdasarkan frekuensi kemunculannya pada dokumen tanpamemperhatikan hubungan semantik antar term. Dalam Kenyataannya hubungansemantik antar term memiliki peranan penting untuk meningkatkan relevansi hasilpencarian dokumen. Penelitian ini mengembangkan metode TF.IDF.ICF.IBFdengan menambahkan Latent Semantic Indexing untuk menemukan hubungansemantik antar term pada kasus perangkingan dokumen berbahasa Arab. Datasetyang digunakan diambil dari kumpulan dokumen pada perangkat lunak MaktabahSyamilah. Hasil pengujian menunjukkan bahwa metode yang diusulkanmemberikan nilai evaluasi yang lebih baik dibandingkan dengan metodeTF.IDF.ICF.IBF. Secara berurut nilai f-measure metode TF.IDF.ICF.IBF.LSIpada ambang cosine similarity 0,3, 0,4, dan 0,5 adalah 45%, 51%, dan 60%. Namun metode yang disulkan memiliki waktu komputasi rata-rata lebih tinggidibandingkan dengan metode TF.IDF.ICF.IBF sebesar 2 menit 8 detik

    Pembobotan Kata Berbasis Preferensi Dan Hubungan Semantik Pada Dokumen Fiqih Berbahasa Arab

    Get PDF
    AbstrakDalam proses pencarian dokumen, pengguna sering menginginkan hasil pencarian yang sesuai dengan preferensi yang diinginkannya. Maka, untuk memperoleh hasil pencarian yang sesuai dengan preferensi tersebut dibutuhkan suatu metode pembobotan kata yang didasarkan pada preferensi tersebut. Metode pembobotan tersebut perlu mempertimbangkan hubungan semantik antar kata untuk meningkatkan relevansi hasil pencarian. Dalam penelitian ini diusulkan metode pembobotan kata berbasis preferensi berdasarkan hubungan semantik antar kata pada dokumen fiqih berbahasa Arab. Latent Semantic Indexing merupakan salah satu metode indexing dalam sistem temu kembali informasi yang mempertimbangkan hubungan semantik antar kata. Hasil pembobotan kata berdasarkan preferensi dijadikan sebuah matriks untuk perhitungan Latent Semantic Indexing yang menghasilkan sebuah vektor. Vektor tersebut dihitung similaritasnya antara vektor query dengan vektor-vektor dokumen yang ada. Metode pembobotan kata berbasis preferensi yang mempertimbangkan hubungan semantik antar kata dapat digunakan dalam perankingan dokumen fiqih bahasa Arab berbasis preferensi. Hal tersebut dapat dilihat dari nilai maksimal precision, recall dan f-measure yang meningkat menjadi 88.75 %, 89.72% dan  87.91%.Kata kunci: Bahasa Arab, Latent Semantic Indexing, Pembobotan Kata, PreferensiAbstractIn the document search process is not uncommon users want search results that correspond to the desired preferences. Thus, to obtain the search results according to user preferences needed a word weighting method based on user preference. The term weighting method needs to consider the semantic relationships between words to improve the relevance of search results. This paper propose a new method of term weighting based preference by considering the semantic relationships between term in documents fiqh Arabic. Latent Semantic Indexing is a method of indexing in information retrieval system that takes the semantic relationships between words. Term weighting results based on preferences made a matrix for calculation of Latent Semantic Indexing which generate a vector for the calculated similarity between the query vector of vectors documents. Term weighting based preference by considering the semantic relationships between term method can be used on the rank documents fiqh Arabic. It can be seen from the value of the precision, recall, and F-measure which increase to 88.75 %, 89.72 % and 87.91 %.Keywords: : Arabic, Latent Semantic Indexing, Term Weighting, Preferenc

    Analisis Kinerja Information Retrieval dengan Menggunakan Kombinasi Latent Semantic Indexing dan Relevance Feedback

    Get PDF
    ABSTRAKSI: Suatu sistem Information Retrieval yang baik memiliki tingkat relevansi yang bisa diterima oleh pengguna. Untuk dapat menghasilkan nilai relevansi yang tinggi, maka salah satu caranya, sistem ini perlu menerapkan metode perangkingan yang baik dan teruji. Kemudian yang menjadi pertanyaan, bagaimana menentukan suatu kinerja metode perangkingan. Kinerja suatu metode perangkingan ditentukan oleh relevansinya yang diukur dengan parameter precision dan recall. Latent Semantic Indexing pada tugas akhir ini akan dikombinasikan dengan relevance feedback, sehingga untuk mengukur kinerjanya perlu diimplementasikan ke dalam perangkat lunak untuk kemudian diuji parameternya.Dalam suatu pengujian diperlukan metode lain sebagai pembanding untuk mengukur kinerja Latent Semantic Indexing yang dikombinasikan dengan Relevance Feedback, maka dipilihlah Vector Space Model sebagai pembanding.Hasil pengujian dari tugas akhir ini menunjukkan bahwa Latent Semantic Indexing memiliki precision dan recall yang lebih baik dari Vector Space Model. Sedangkan relevance feedback pada Vector Space Model terbukti mampu meningkatkan relevansi, sementara keanomalian terjadi pada LSI, dimana relevansinya malah menurun.Kata Kunci : Information Retrieval, LSI, SVD, relevance feedback, VSM.ABSTRACT: A Good Information Retrieval System should have reasonably high relevancy. To get a high relevancy value, the system had to applying a good and tested ranking method. Thus, the question is, how to determine ranking methods performance. Ranking methods performance determined by it\u27s relevancy which is measured with precision and recall parameters. Latent Semantic Indexing in this final assignment will be combined with relevance feedback, if we want to measure it\u27s performance we have to build the software to test it\u27s parameters.In order to testing a method we need another method as comparison to measure Latent Semantic Indexing combined with relevance feedback,thus, Vector space model has been choosen as comparison.The testing results of this final assignment show that Latent Semantic Indexing has better precision and recall than the Vector Space Model. Relevance Feedback which is applied to vector space model has been proved increasing it\u27s relevancy, despite anomaly happen in LSI where it\u27s relevancy decreased.Keyword: Information Retrieval, LSI, SVD, relevance feedback, VSM
    corecore