16 research outputs found

    Efficiency and effectiveness of query processing in cluster-based retrieval

    Get PDF
    Our research shows that for large databases, without considerable additional storage overhead, cluster-based retrieval (CBR) can compete with the time efficiency and effectiveness of the inverted index-based full search (FS). The proposed CBR method employs a storage structure that blends the cluster membership information into the inverted file posting lists. This approach significantly reduces the cost of similarity calculations for document ranking during query processing and improves efficiency. For example, in terms of in-memory computations, our new approach can reduce query processing time to 39% of FS. The experiments confirm that the approach is scalable and system performance improves with increasing database size. In the experiments, we use the cover coefficient-based clustering methodology (C3M), and the Financial Times database of TREC containing 210158 documents of size 564 MB defined by 229748 terms with total of 29545234 inverted index elements. This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering. © 2003 Elsevier Ltd. All rights reserved

    Large-scale cluster-based retrieval experiments on Turkish texts

    Get PDF
    We present cluster-based retrieval (CBR) experiments on the largest available Turkish document collection. Our experiments evaluate retrieval effectiveness and efficiency on both an automatically generated clustering structure and a manual classification of documents. In particular, we compare CBR effectiveness with full-text search (FS) and evaluate several implementation alternatives for CBR. Our findings reveal that CBR yields comparable effectiveness figures with FS. Furthermore, by using a specifically tailored cluster-skipping inverted index we significantly improve in-memory query processing efficiency of CBR in comparison to other traditional CBR techniques and even FS

    Site-based dynamic pruning for query processing in search engines

    Get PDF
    Web search engines typically index and retrieve at the page level. In this study, we investigate a dynamic pruning strategy that allows the query processor to first determine the most promising websites and then proceed with the similarity computations for those pages only within these sites

    Efficient processing of category-restricted queries for web directories

    Get PDF
    We show that a cluster-skipping inverted index (CS-IIS) is a practical and efficient file structure to support category-restricted queries for searching Web directories. The query processing strategy with CS-IIS improves CPU time efficiency without imposing any limitations on the directory size. © 2008 Springer-Verlag Berlin Heidelberg

    Noisy Text Clustering

    Get PDF
    This work presents document clustering experiments performed over noisy texts (i.e. text that have been extracted through an automatic process like speech or character recognition). The effect of recognition errors on different clustering techniques is measured through the comparison of the results obtained with clean (manually typed texts) and noisy (automatic speech transcripts affected by 30%30\% Word Error Rate) versions of the TDT2 corpus (600\sim600 hours of spoken data from broadcast news). The results suggest that clustering can be performed over noisy data with an acceptable performance degradation

    Exploiting Index Pruning Methods for Clustering XML Collections

    Get PDF
    In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3 M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics

    Exploiting index pruning methods for clustering XML collections

    Get PDF
    In this paper, we first employ the well known Cover-Coefficient Based Clustering Methodology (C3M) for clustering XML documents. Next, we apply index pruning techniques from the literature to reduce the size of the document vectors. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more specifically, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics. © 2010 Springer-Verlag Berlin Heidelberg

    Analisis dan Implementasi Algoritma Overlapping Cover Coefficient-Based Clustering Method (OC3M) Pada Dokumen Teks Berbahasa Indonesia

    Get PDF
    Dengan bertambah pesatnya informasi/dokumen yang beredar di internet sehingga memungkinkan untuk suatu dokumen dapat dikelompokkan ke dalam dua atau lebih kategori sekaligus. Oleh karena itu dibutuhkan suatu metode untuk mengelompokkan dokumen-dokumen tersebut ke dalam dua atau lebih kategori sekaligus. Overlapping Cover Coefficient Clustering Method (OC3M) adalah suatu metode pengelompokan dokumen dengan model probabilitik, kesamaan term, dan seed dokumen sebagai inisialisasi awal dari pembentukan cluster. Pada metode ini diterapkan sifat overlap, yaitu kondisi dimana dokumen dapat menempati lebih dari satu cluster. Pengujian yang dilakukan pada tugas akhir ini dalam mengelompokkan dokumen dengan algoritma OC3M yaitu menganalisis cluster yang dihasilkan berdasarkan nilai Silhouette Coefficient-nya serta menganalisis hal-hal yang mempengaruhi kualitas cluster yang terbentuk. Kualitas cluster yang terbentuk dipengaruhi oleh banyaknya dokumen yang digunakan, tipe dokumen, kemiripan dokumen dengan pusat cluster, dan juga dipengaruhi oleh overlapping coefficient yaitu parameter yang menentukan banyaknya suatu dokumen yang similar dapat dikelompokkan ke dalam cluster yang berbeda. Dari hasil percobaan, kualitas cluster yang terbentuk dengan menggunakan algoritma OC3M memiliki kualitas yang cukup baik, ini di tunjukkan dengan nilai silhouette coefficient yang bernilai positif

    Algorithms for within-cluster searches using inverted files

    Get PDF
    Information retrieval over clustered document collections has two successive stages: first identifying the best-clusters and then the best-documents in these clusters that are most similar to the user query. In this paper, we assume that an inverted file over the entire document collection is used for the latter stage. We propose and evaluate algorithms for within-cluster searches, i.e., to integrate the best-clusters with the best-documents to obtain the final output including the highest ranked documents only from the best-clusters. Our experiments on a TREC collection including 210,158 documents with several query sets show that an appropriately selected integration algorithm based on the query length and system resources can significantly improve the query evaluation efficiency. © Springer-Verlag Berlin Heidelberg 2006

    Effect of Recognition Errors on Text Clustering

    Get PDF
    This paper presents clustering experiments performed over noisy texts (i.e. texts that have been extracted through an automatic process like character or speech recognition). The effect of recognition errors is investigated by comparing clustering results performed over both clean (manually typed data) and noisy (automatic speech transcriptions) versions of the same speech recording corpus
    corecore