36 research outputs found

    User-Centered Comparison of Web Search Tools

    Full text link
    This study explores a user-centered approach to the comparative evaluation of the Web search tool ProThes against popular all-purpose search engines Yandex and Google. An original research design was developed. Data were collected from 12 volunteers who performed 48 search tasks in total. Main outcomes include: (1) search strategy supported through ProThes can be quite effective for focused Web search and (2) ProThes’ interface and system performance must be improved.The research was supported in part by the Russian Fund of Basic Research, grant # 03-07-90342

    Automatic Geotagging of Russian Web Sites

    Full text link
    The poster describes a fast, simple, yet accurate method to associate large amounts of web resources stored in a search engine database with geographic locations. The method uses location-by-IP data, domain names, and content-related features: ZIP and area codes. The novelty of the approach lies in building location-by-IP database by using continuous IP blocks method. Another contribution is domain name analysis. The method uses search engine infrastructure and makes it possible to effectively associate large amounts of search engine data with geography on a regular basis. Experiments ran on Yandex search engine index; evaluation has proved the efficacy of the approach.ACM Special Interest Group on Hypertext, Hypermedia, and We

    ProThes: Thesaurus-based Meta-Search Engine for a Specific Application Domain

    Full text link
    In this poster we introduce ProThes, a pilot meta-search engine (MSE) for a specific application domain. ProThes combines three approaches: meta-search, graphical user interface (GUI) for query specification, and thesaurus-based query techniques. ProThes attempts to employ domain-specific knowledge, which is represented by both a conceptual thesaurus and results ranking heuristics. Since the knowledge representation is separated from the MSE core, adjusting the system to a specific domain is trouble free. Thesaurus allows for manual query building and automatic query techniques. This poster outlines the overall system architecture, thesaurus representation format, and query operations. ProThes is implemented on J2EE platform as a Web service.The project was supported in part by the Russian Fund of Basic Research, grant # 03-07-90342

    A Large-Scale Community Questions Classification Accounting for Category Similarity: An Exploratory?

    Full text link
    The paper reports on a large-scale topical categorization of questions from a Russian community question answering (CQA) service [email protected]. We used a data set containing all the questions (more than 11 millions) asked by [email protected] users in 2012. This is the first study on question categorization dealing with non-English data of this size. The study focuses on adjusting category structure in order to get more robust classification results. We investigate several approaches to measure similarity between categories: the share of identical questions, language models, and user activity. The results show that the proposed approach is promising.14-07-00589; RFBR; Russian Foundation for Basic Research

    Что и как спрашивают в социальных вопросно-ответных сервисах по-русски?

    Full text link
    In our study we surveyed different approaches to the study of questions in traditional linguistics, question answering (QA), and, recently, in community question answering (CQA). We adapted a functional-semantic classification scheme for CQA data and manually labeled 2,000 questions in Russian originating from [email protected] CQA service. About half of them are purely conversational and do not aim at obtaining actual information. In the subset of meaningful questions the major classes are requests for recommendations, or how-questions, and fact-seeking questions. The data demonstrate a variety of interrogative sentences as well as a host of formally non-interrogative expressions with the meaning of questions and requests. The observations can be of interest both for linguistics and for practical applications

    Experiment on Style-Dependent Document Ranking

    Full text link
    The paper reports on experiments aimed at incorporating style-dependent parameters into ranking schemata in information retrieval tasks. We use ROMIP Web collection and ROMIP-2003 ad-hoc track results in the analysis. Factor analysis techniques have been used to extract factors that would reflect stylistic properties of documents. Comparison of the obtained style-dependent parameters and their derived ranks is conducted. A simple schema for rank aggregation is proposed. Evaluation of the results shows only moderate improvement of relevance ranking.В работе описывается эксперимент по использованию стилистических параметров в ранжировании документов для задачи информационного поиска. В эксперименте использована Веб-коллекция РОМИП, а также результаты оценки дорожки Веб-поиска РОМИП-2003. Для выделения факторов, отражающих стиль документа, использовались методы факторного анализа. Проведено сравнение полученных стилистических параметров и рангов на их основе. Предложена простая схема агрегации рангов. Оценка результатов показала, что метод может давать только незначительное повышение качества ранжирования

    Towards Automatic Evaluation of Health-Related CQA Data

    Full text link
    The paper reports on evaluation of Russian community question answering (CQA) data in health domain. About 1,500 question-answer pairs were manually evaluated by medical professionals, in addition automatic evaluation based on reference disease-medicine pairs was performed. Although the results of the manual and automatic evaluation do not fully match, we find the method still promising and propose several improvements. Automatic processing can be used to dynamically monitor the quality of the CQA content and to compare different data sources. Moreover, the approach can be useful for symptomatic surveillance and health education campaigns.This work is partially supported by the Russian Foundation for Basic Research, project #14-07-00589 “Data Analysis and User Modelling in Narrow-Domain Social Media”. We also thank assessors who volunteered for the evaluation and Mail.Ru for granting us access to the data

    Learning to predict closed questions on stack overflow

    Full text link
    The paper deals with the problem of predicting whether the user’s question will be closed by the moderator on Stack Overflow, a popular question answering service devoted to software programming. The task along with data and evaluation metrics was offered as an open machine learning competition on Kaggle platform. To solve this problem, we employed a wide range of classification features related to users, their interactions, and post content. Classification was carried out using several machine learning methods. According to the results of the experiment, the most important features are characteristics of the user and topical features of the question. The best results were obtained using Vowpal Wabbit – an implementation of online learning based on stochastic gradient descent. Our results are among the best ones in overall ranking, although they were obtained after the official competition was over

    English → Russian MT evaluation campaign

    Get PDF
    This paper presents the settings and the result of the ROMIP 2013 MT shared task for the English→Russian language direction. The quality of generated translations was assessed using automatic metrics and human evaluation. We also discuss ways to reduce human evaluation efforts using pairwise sentence comparisons by human judges to simulate sort operations