132 research outputs found

    Online Forum Thread Retrieval using Pseudo Cluster Selection and Voting Techniques

    Full text link
    Online forums facilitate knowledge seeking and sharing on the Web. However, the shared knowledge is not fully utilized due to information overload. Thread retrieval is one method to overcome information overload. In this paper, we propose a model that combines two existing approaches: the Pseudo Cluster Selection and the Voting Techniques. In both, a retrieval system first scores a list of messages and then ranks threads by aggregating their scored messages. They differ on what and how to aggregate. The pseudo cluster selection focuses on input, while voting techniques focus on the aggregation method. Our combined models focus on the input and the aggregation methods. The result shows that some combined models are statistically superior to baseline methods.Comment: The original publication is available at http://www.springerlink.com/. arXiv admin note: substantial text overlap with arXiv:1212.533

    United we fall, divided we stand: A study of query segmentation and PRF for patent prior art search

    Get PDF
    Previous research in patent search has shown that reducing queries by extracting a few key terms is ineffective primarily because of the vocabulary mismatch between patent applications used as queries and existing patent documents. This finding has led to the use of full patent applications as queries in patent prior art search. In addition, standard information retrieval (IR) techniques such as query expansion (QE) do not work effectively with patent queries, principally because of the presence of noise terms in the massive queries. In this study, we take a new approach to QE for patent search. Text segmentation is used to decompose a patent query into selfcoherent sub-topic blocks. Each of these much shorted sub-topic blocks which is representative of a specific aspect or facet of the invention, is then used as a query to retrieve documents. Documents retrieved using the different resulting sub-queries or query streams are interleaved to construct a final ranked list. This technique can exploit the potential benefit of QE since the segmented queries are generally more focused and less ambiguous than the full patent query. Experiments on the CLEF-2010 IP prior-art search task show that the proposed method outperforms the retrieval effectiveness achieved when using a single full patent application text as the query, and also demonstrates the potential benefits of QE to alleviate the vocabulary mismatch problem in patent search

    Ten years of MIREX: reflections, challenges and opportunities

    Get PDF
    The Music Information Retrieval Evaluation eXchange (MIREX) has been run annually since 2005, with the October 2014 plenary marking its tenth iteration. By 2013, MIREX has evaluated approximately 2000 individual music information retrieval (MIR) algorithms for a wide range of tasks over 37 different test collections. MIREX has involved researchers from over 29 different contrives with a median of 109 individual participants per year. This pater summarizes the history of MIREX form its earliest planning meeting in 2001 to the present. It reflects upon the administrative, financial, and technological challenges MIREX has faced and describes how those challenges have been surmounted. We propose new funding models, a distributed evaluation framework, and more holistic user experience evaluation tasks-some evolutionary, some revolutionary-for the continued success of MIREX. We hope that this paper will inspire MIR community members to contribute their ideas so MIREX can have many more successful years to come

    Human assessments of document similarity

    Get PDF
    Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems
    corecore