4 research outputs found
Intégration des Analyses du Français dans la Recherche d'Information
International audienceCet article décrit des approches que nous avons implantées dans le cadre d'une collaboration de recherche entre nos deux groupes. Ces approches visent à créer une représentation plus précise pour les documents et les requêtes dans un SRI. Elles sont basées sur des extractions de termes composés, au lieu de termes simples utilisés dans les approches traditionnelles. Deux approches sont employées: par une analyse syntaxico-statistique et par l'utilisation d'une base de terminologie manuelle. Nous décrivons ces deux approches, ainsi que les résultats préliminaires obtenus
Pengukuran Kemiripan Term Berbasis Co-Occurrence dan Inverse Class Frequency Pada Pengembangan Thesaurus Bahasa Arab
Thesaurus merupakan tools yang bermanfaat untuk melakukan query expansion dalam pencarian dokumen. Thesaurus adalah kamus yang dibentuk dengan melihat kemiripan term. Kemiripan term dalam pembentukan thesaurus secara otomatis salah satunya dilakukan dengan pendekatan statistikal dari term pada dokumen-dokumen corpus. Beberapa thesaurus pada bahasa arab dibentuk dengan menggunakan pendekatan statistikal. Salah satu pendekatan statistikal adalah teknik co-occurrence yang memperhatikan frekuensi kemunculan term secara bersama-sama. Melihat kemiripan term dalam pembentukan thesaurus tidak hanya bergantung pada nilai informatif suatu term terhadap dokumen. Namun juga nilai informatif suatu term terhadap cluster. Dokumen-dokumen corpus dikumpulkan kemudian dilakukan proses preprocessing untuk medaptakan daftar term. Daftar term tersebut akan dihitung nilai TF-IDF nya sebagi fitur untuk melakukan clustering pada dokumen. Dokumen yang telah ter-cluster akan dijadikan patokan untuk menghitung nilai Inverse Class frequency (ICF). Nilai TF – ICF digunakan untuk perhitungan cluster weight pada teknik co-occurence dimana perhitungan tersebut memperhatikan kemunculan bersama kedua term. Hasil dari cluster weight yang melibatkan TF-ICF tersebut menjadi patokan nilai kemiripan term dalam pembentukan thesaurus. Pengujian terhadap thesaurus hasil bentukan metode usulan menghasilkan nilai precision tertinggi sebesar 76,7% sedangkan recall memiliki nilai terbesar 81,8% dan f-measure sebesar 54,1%.
============================================================================================
Thesaurus is a useful tool to perform query expansion in the document search. Dictionary Thesaurus is formed by looking at the similarities term. Similarities in the formation of a thesaurus term is automatically one of them carried out by statistical approach of the term in the document corpus. Some thesaurus in Arabic is formed by using a statistical approach. One approach is a statistical technique that takes into account the co-occurrence frequency of occurrence of terms together. See the resemblance in the formation of a thesaurus term depends not only on the informative value of a term of the document. But also informative value of a term to the cluster. The documents collected corpus preprocessing process is then performed to medaptakan term list. The term list will be calculated the value of its TF-IDF as a feature to perform clustering on the document. Documents that have already been cluster will be used as a benchmark to calculate the value of Inverse Class frequency (ICF). TF value - ICF is used for the calculation of weight in the engineering cluster co-occurence where the calculation of the notice of appearance with the two terms. Results of cluster weight involving TF-ICF has become a benchmark value of term similarity in the formation of a thesaurus. Tests on the thesaurus result form the proposed method produces the highest precision value amounted to 76.7%, while the recall has the greatest value 81.8% and f-measure of 54.1%
Recommended from our members
Knowledge Based Information Retrieval: A Semiotic Approach
The overall objective of this study is to analyze the document retrieval process and the main information retrieval (IR) concepts from the point of view of semiotics and design retrieval mechanisms based on the findings of the semiotic analysis of the retrieval situation. Semiotics is a discipline which studies 'sign systems' and how signs are exchanged in communication. The semiotic view of IR interaction presented in this dissertation views document retrieval as a kind of human communication process taking place in a social and cultural realm.
The most important result of the semiotic model developed is the explication of the distinction between the knowledge production and transfer functions of document retrieval. The consequence of this finding is the conceptualization of the retrieval process as a dynamic and complex interplay between knowledge production and transfer tasks. It is hypothesised that, in the case of knowledge production, users of retrieval systems are interested in exploring new areas of the document collection which are not a priori known.
Two knowledge based systems are developed based on the Okapi probabilistic retrieval system. The purpose of the retrieval systems designed is posited, in general terms, as to suggest the users new search areas of potential interest. This is achieved by treating the Inspec thesaurus as a semantic network, and applying a heuristic spreading activation technique to generate clusters of terms linked in the Inspec thesaurus. Each cluster or batch of terms is conceived as representing a part of the general search area defined by the initial user search terms. The main design objective here is to enable the user to identify new search areas from the term information contained in the batches.
Two evaluation experiments were carried out with real users who had real information needs to test whether the batches were actually effective in defining search areas related to the original user queries and whether they were useful in pointing new areas which were potentially relevant to the users. A number of hypotheses related to the retrieval effectiveness of the knowledge based systems designed were also tested in the experiments. The main findings of the experiments indicate that:
• the batches were useful in representing search domains relevant to the users' queries
• in many cases the batches represented new ideas or new search domains to the users
• the knowledge based systems had similar retrieval effectiveness in terms of precision as the Okapi syste