21 research outputs found

    Ewaluacja skuteczności systemów wyszukiwania informacji. Wyniki eksperymentu Polish Task realizowanego w ramach Conference and Labs of the Evaluation Forum (CLEF) 2012

    Get PDF
    W niniejszym artykule prezentujemy realizację laboratorium ewaluacyjnego CLEF (Conference and Labs of the Evaluation Forum) ze specjalnym uwzględnieniem kampanii CHiC (Cultural Heritage in CLEF). Opisujemy realizację oraz wyniki zadania Polish Task in ChiC. W artykule zaprezentowano wnioski z realzacji zadania. Zostały omówione wyniki uzyskane przez uczestników zadania przy użyciu różnych strategii indeksowania oraz wyszukiwania zasobów. Porównaliśmy efektywność metod tf-idf, OKAPI, DFR oraz data fusion.The article presents the design of CLEF (Conference and Labs of the Evaluation Forum) evaluation labs with special attention paid to CHiC (Cultural Heritage in CLEF). We describe design of Polish Task in CHiClab and discuss conclusions from lab realisation. We discuss results achieved by different participants using different indexing and matching approaches. Efficiency of tf-idf, OKAPI, DFR and data fusion was compared and analysed

    DCU@FIRE-2012: rule-based stemmers for Bengali and Hindi

    Get PDF
    For the participation of Dublin City University (DCU) in the FIRE-2012 Morpheme Extraction Task (MET), we investigated a rule based stemming approaches for Bengali and Hindi IR. The MET task itself is an attempt to obtain a fair and direct comparison between various stemming approaches measured by comparing the retrieval effectiveness obtained by each on the same dataset. Linguistic knowledge was used to manually craft the rules for removing the commonly occurring plural suffixes for Hindi and Bengali. Additionally, rules for removing classifiers and case markers in Bengali were also formulated. Our rule-based stemming approaches produced the best and the second-best retrieval effectiveness for Hindi and Bengali datasets respectively

    DCU@FIRE2010: term conflation, blind relevance feedback, and cross-language IR with manual and automatic query translation

    Get PDF
    For the first participation of Dublin City University (DCU) in the FIRE 2010 evaluation campaign, information retrieval (IR) experiments on English, Bengali, Hindi, and Marathi documents were performed to investigate term conation (different stemming approaches and indexing word prefixes), blind relevance feedback, and manual and automatic query translation. The experiments are based on BM25 and on language modeling (LM) for IR. Results show that term conation always improves mean average precision (MAP) compared to indexing unprocessed word forms, but different approaches seem to work best for different languages. For example, in monolingual Marathi experiments indexing 5-prefixes outperforms our corpus-based stemmer; in Hindi, the corpus-based stemmer achieves a higher MAP. For Bengali, the LM retrieval model achieves a much higher MAP than BM25 (0.4944 vs. 0.4526). In all experiments using BM25, blind relevance feedback yields considerably higher MAP in comparison to experiments without it. Bilingual IR experiments (English!Bengali and English!Hindi) are based on query translations obtained from native speakers and the Google translate web service. For the automatically translated queries, MAP is slightly (but not significantly) lower compared to experiments with manual query translations. The bilingual English!Bengali (English!Hindi) experiments achieve 81.7%-83.3% (78.0%-80.6%) of the best corresponding monolingual experiments

    Development of SearchBadger, A Framework for Evaluation of Search Results

    Full text link

    Ewaluacja skuteczności systemów wyszukiwania informacji. Od eksperymentu Cranfield do laboratoriów TREC i CLEF. Geneza i metody

    Get PDF
    We present the genesis and evolution of methods and measures of IR systems evaluation. The design of the Cranfield experiment, a long-term model for evaluation methodology, is described. Evolution of current methodology of IR systems evaluation, developed at the annual TREC (Text REtrieval Conference) is provided, and the most popular and current measures described. The article presents also design of the CLEF (Conference and Labs of the Evaluation Forum) evaluation labs with special attention paid to CHiC (Cultural Heritage in CLEF). We describe the design of Polish Task in CHiClab and discuss conclusions from lab realisation.W niniejszym artykule prezentujemy rozwój metod i miar służących do oceny efektywności systemów informacyjno-wyszukiwawczych. Zostały w nim opisane założenia eksperymentu Cranfield, jako długoletniego wyznacznika metodologii ewaluacyjnej, oraz zarzuty stawiane organizacji samego eksperymentu. Ważną częścią artykułu jest także opis ewolucji powszechnie dziś stosowanej metodologii ewaluacji systemów informacyjno-wyszukiwawczych, wypracowanej podczas dorocznych konferencji TREC (Text REtrieval Conference), a także omówienie najpowszechniej obecnie stosowanych miar ewaluacyjnych w tym zakresie. Artykuł przedstawia również organizację laboratoriów ewaluacyjnych CLEF (Conference and Labs of the Evaluation Forum) ze szczególnym uwzględnieniem panelu CHiC (Cultural Heritage in CLEF), a na gruncie języka polskiego – Polish Task in CHiC

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    Utilizando text mining na taxonomia processual

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Faculdade UnB Gama, 2018.O trabalho apresenta uma avaliação de métodos supervisionados de classificação utilizando como entrada processos judiciais. O sistema judicial brasileiro recebe milhões de processos por ano, que possuem uma variedade de informações que interessam diversos atores inclusive o poder executivo. A tarefa de classificar esses processos para análise em pesquisas específicas é uma tarefa hercúlea sendo uma grande oportunidade para o uso de algoritmos inteligentes. O trabalho utilizou os algoritmos knn e naive bayses para classificar os processos judiciais e avaliou a performance dos dois algoritmos. O trabalho resultou em valores adequados para os dois algoritmos podendo os dois serem usados para a classificação de processos judiciais.The paper presents an evaluation of supervised learning methods of classification using as input lawsuits. The Brazilian judicial system receives millions of cases per year, which have a variety of information that interests research papaer and stakeholders including the executive power. The task of classifying these processes for analysis in specific searches is a Herculean task being a great opportunity for the use of intelligent algorithms. The work utilized the knn and naive bayses algorithms to classify the judicial processes and evaluated the performance of the two algorithms. The work resulted in adequate values for the two algorithms, both of which can be used to classify lawsuits
    corecore