14 research outputs found

    Automatic Classification of Text Databases through Query Probing

    Get PDF
    Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

    Approaches to collection selection and results merging for distributed information retrieval

    Get PDF

    Méthodes pour la sélection de collections dans un environnement distribué

    Get PDF
    http://www.emse.fr/~mbeig/PUBLIS/2002-cide-p227-abbaci.pdfInternational audienceNous explorons dans cet article trois approches de sélection de collections dans un environnement de recherche d'informations distribuée. Le processus de recherche se fait par l'intermédiaire d'un courtier qui pour une requête donnée sélectionne les collections à interroger et fusionne les résultats qu'elles retournent. Notre première approche de sélection consiste à classer les collections selon leur pertinence à la requête posée, les n premières collections sont alors interrogées. La seconde approche sélectionne les collections dont le score dépasse un certain seuil. Enfin, la troisième approche définit le nombre de documents à rechercher dans chaque collection. L'originalité de notre démarche est qu'elle utilise des données récoltées au moment de l'interrogation et ne repose pas sur des méta-données sauvegardées a priori au niveau du courtier comme c'est le cas de la plupart des méthodes connues dans la littérature. Afin d'évaluer nos approches et les comparer aux autres techniques notamment l'approche centralisée (à index unique) et CORI [CALL95] [XU98], nous avons conduit des expérimentations sur la collection de test WT10g, et les gains sont appréciable

    Merging Multiple Search Results Approach for Meta-Search Engines

    Get PDF
    Meta Search Engines are finding tools developed for enhancing the search performance by submitting user queries to multiple searchengines and combining the search results in a unified ranked list. They utilized data fusion technique, which requires three major steps: databases selection, the results combination, and the results merging. This study tries to build a framework that can be used for merging the search results retrieved from any set of search engines. This framework based on answering three major questions:1.How meta-search developers could define the optimal rank order for the selected engines.2. How meta-search developers could choose the best search engines combination.3.What is the optimal heuristic merging function that could be used for aggregating the rank order of the retrieved documents form incomparable search engines.The main data collection process depends onrunning 40 general queries on three major search engines (Google, AltaVista, and Alltheweb). Real users have involved in the relevance judgment process for a five point relevancy scale. Theperformance of the three search engines, their different combinations and different merging algorithm have been compared to rank the database, choose the best combination and define the optimal merging function.The major findings of this study are (1) Ranking the databases in merging process should depends on their overall performance not their popularity or size; (2)Larger databases tend to perform better than smaller databases; (3)The combination of the search engines should depend on ranking the database and choosing theappropriate combination function; (4)Search Engines tend to retrieve more overlap relevant document than overlap irrelevant documents; and (5) The merging function which take theoverlapped documents into accounts tend to perform better than the interleave and the rank similarity function.In addition to these findings the study has developed a set of requirements for the merging process to be successful. This procedure include the databases selection, the combination, and merging upon heuristic solutions
    corecore