9 research outputs found

    How to Evaluate your Question Answering System Every Day and Still Get Real Work Done

    Full text link
    In this paper, we report on Qaviar, an experimental automated evaluation system for question answering applications. The goal of our research was to find an automatically calculated measure that correlates well with human judges' assessment of answer correctness in the context of question answering tasks. Qaviar judges the response by computing recall against the stemmed content words in the human-generated answer key. It counts the answer correct if it exceeds agiven recall threshold. We determined that the answer correctness predicted by Qaviar agreed with the human 93% to 95% of the time. 41 question-answering systems were ranked by both Qaviar and human assessors, and these rankings correlated with a Kendall's Tau measure of 0.920, compared to a correlation of 0.956 between human assessors on the same data.Comment: 6 pages, 3 figures, to appear in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000

    Satisfacción de usuarios del ámbito de la traducción en el uso de sistemas de búsqueda multilingüe de respuestas como recurso de información terminológica

    Get PDF
    The present work focuses exclusively on the evaluation of multilingual answer search systems since it allows the user to access terminological information not available in their language, and in the user-centered evaluation to understand the needs of the user and identify the dimensions and factors in the development of an information system in order to improve its acceptance

    Analyses for elucidating current question answering technology

    Get PDF
    Journal ArticleIn this paper, we take a detailed look at the performance of components of an idealized question answering system on two diff erent tasks: the TREC Question Answering task and a set of reading comprehension exams. We carry out three types of analysis: inherent properties of the data, feature analysis, and performance bounds. Based on these analyses we explain some of the performance results of the current generation of Q/A systems and make predictions on future work. In particular, we present four fi ndings: (1) Q/A system performance is correlated with answer repetition; (2) relative overlap scores are more effective than absolute overlap scores; (3) equivalence classes on scoring functions can be used to quantify performance bounds; and (4) perfect answer typing still leaves a great deal of ambiguity for a Q/A system because sentences often contain several items of the same type

    Satisfacción de los usuarios en la búsqueda multilingüe de respuestas como recursos de información terminológica

    Get PDF
    Con el rápido crecimiento de Internet y el desarrollo de las tecnologías de la información y la comunicación en los últimos años, los sistemas de búsquedas de respuestas (SBR) se han convertido en una alternativa a los tradicionales sistemas de recuperación de información. El presente trabajo se centra en los SBR multilingües —que permiten al usuario acceder a información terminológica no disponible en su lengua— y, concretamente, en su evaluación desde la perspectiva del usuario —la cual hace posible aprehender sus necesidades e identificar las dimensiones y factores relevantes en el desarrollo de los sistemas con el fin de mejorar su aceptación—. El objetivo del estudio es conocer el grado de satisfacción de los usuarios especializados en el ámbito de la traducción respecto a los SBR de dominio restringido como fuente de información terminológica. Para ello se ha llevado a cabo una evaluación centrada en el usuario del SBR multilingüe (inglés, francés e italiano) HONqa como recurso de información terminológica. Como herramienta de evaluación se ha aplicado un riguroso cuestionario, ya testado y validado (Ong et al., 2009), que surge tras una exhaustiva revisión de modelos y teorías relacionados con la aceptación y uso de la tecnología. El análisis de los resultados pone de manifiesto que el sistema les resulta a los usuarios fácil y útil para la recuperación de información terminológica en todos los idiomas evaluados

    Evaluation in natural language processing

    Get PDF
    quot; European Summer School on Language Logic and Information(ESSLLI 2007)(Trinity College Dublin Ireland 6-17 August 2007

    On enhancing the robustness of timeline summarization test collections

    Get PDF
    Timeline generation systems are a class of algorithms that produce a sequence of time-ordered sentences or text snippets extracted in real-time from high-volume streams of digital documents (e.g. news articles), focusing on retaining relevant and informative content for a particular information need (e.g. topic or event). These systems have a range of uses, such as producing concise overviews of events for end-users (human or artificial agents). To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. sentences) to an explicit representation of what information a ‘good’ summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such evaluations fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which the timeline summarization test collections fail to generalize to new summarization systems, then we propose, evaluate and analyze new automatic solutions to this issue. In particular, using a depooling methodology over 19 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being mis-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of mis-ranking systems, we also propose a range of different automatic ground truth label expansion techniques. Our results show that the proposed expansion techniques can be effective at increasing the robustness of the TREC-TS test collections, as they are able to generate large numbers missing matches with high accuracy, markedly reducing the number of mis-rankings by up to 50%

    Infra-estrutura de um serviço online de resposta-a-perguntas com base na web portuguesa

    Get PDF
    Trabalho de projecto de mestrado em Engenharia Informática, apresentado à Universidade de Lisboa, através da Faculdade de Ciências, 2007A Internet promoveu uma nova forma de comunicação global, com um impacto profundo na disseminação da informação. Em consequência, tornamse necessárias novas soluções tecnológicas que permitam explorar os recursos actualmente disponíveis. Numa era em que os motores de busca de documentos já fazem parte da vida quotidiana do cibernauta, o próximo passo é permitir que os utilizadores da rede obtenham breves respostas a perguntas específicas. O projecto QueXting foi encetado pelo Grupo de Fala e Linguagem Natural (NLX) do Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa com o objectivo principal de contribuir para um melhor acesso à informação, possibilitando a realização de perguntas em Português e a obtenção de respostas a partir da informação disponível nos documentos escritos em língua portuguesa. Para tal, pretende oferecer livre acesso a um serviço online que processa os documentos da Internet escritos nesta língua e que começará por proporcionar respostas a perguntas factuais. O sistema de respostaaperguntas QueXting tem como pilares a arquitectura e metodologia recentemente amadurecidas neste domínio científico e diversas ferramentas linguísticas, específicas para a língua portuguesa, que o NLX tem vindo a desenvolver. O processamento linguístico específico é um dos factores chave que distingue a tarefa de respostaaperguntas das restantes tarefas de recuperação e extracção de informação, permitindo um processamento profundo dos pedidos de informação e a obtenção de respostas exactas. Esta dissertação apresenta os resultados do desenvolvimento da infraestrutura do sistema QueXting, que servirá de base ao processamento específico de diversos tipos de perguntas factuais. Apresenta ainda os resultados obtidos no processamento de um tipo de pergunta específico, para o qual foram realizadas avaliações preliminares.The Internet promoted a new form of global communication, with deep impact in the dissemination of information. As a consequence, new technological solutions are needed for the exploitation of the resources thus made available. At a time when document search engines are already part of the daily life of Internet users, the next step is to allow these users to obtain brief answers to specific questions. The QueXting project was undertaken by the Natural Language and Speech Group (NLX) at the Department of Informatics of the Faculty of Sciences of the University of Lisbon, with the main goal of contributing to a better access to information available in the Portuguese language. To this end, a web service supporting questions in Portuguese should be made freely available, gathering answers from documents written in this language. The QueXting questionanswering system is implemented through a general methodology and architecture that have recently matured in this scientific domain. Furthermore, it is supported by several linguistic tools, specific for Portuguese, that the NLX group has been developing. This specific linguistic processing is a key factor distinguishing the task of questionanswering from the remaining tasks of information retrieval and extraction, allowing the deep processing of information requests and the extraction of exact answers. This dissertation reports on the development of the infrastructure of the QueXting system, which will support the specific processing of several types of factoid questions. Such processing has already been applied to one specific type of question, for which some preliminary evaluations were completed

    «Ποιος θέλει να γίνει εκατομμυριούχος;» a la Ελληνικά

    Get PDF
    Αυτή η εργασία περιγράφει αναλυτικά τις τεχνικές που χρησιμοποιούνται για την κατασκευή ενός εικονικού παίκτη για το δημοφιλές τηλεοπτικό παιχνίδι «Ποιος θέλει να γίνει εκατομμυριούχος;» και βασίζεται πάνω στο αντίστοιχο άρθρο [1] στο οποίο έχει γίνει υλοποίηση για την αγγλική και την ιταλική έκδοση του παιχνιδιού. Επίσης σε αυτήν την εργασία έγινε μια προσπάθεια εφαρμογής των διαφόρων τεχνικών που περιγράφονται μέσα στο άρθρο. H υλοποίηση του εικονικού παίκτη για την ελληνική έκδοση του παιχνιδιού έγινε στην γλώσσα προγραμματισμού Java και θα παρουσιαστεί αναλυτικά. Ο εικονικός παίκτης πρέπει να απαντήσει σε μια σειρά από ερωτήσεις πολλαπλής επιλογής που τίθενται σε φυσική γλώσσα, επιλέγοντας τη σωστή απάντηση μεταξύ τεσσάρων διαφορετικών επιλογών. Εάν δεν είναι σίγουρος για κάποια απάντηση μπορεί να χρησιμοποιήσει τις σανίδες σωτηρίας (lifelines) ή να αποχωρήσει από το παιχνίδι. Η αρχιτεκτονική του εικονικού παίκτη αποτελείται από 1) μια μονάδα (module) Απάντησης Ερωτημάτων (Question Answering) (QA), η οποία αξιοποιεί την μηχανή αναζήτησης της Google για να ανακτήσει τα πιο σχετικά χωρία κειμένου που είναι χρήσιμα στο να προσδιοριστεί η σωστή απάντηση σε μία ερώτηση, 2) μια μονάδα Βαθμολόγησης Απαντήσεων (Answer Scoring) (AS), η οποία αποδίδει μια βαθμολογία σε κάθε υποψήφια απάντηση σύμφωνα με διαφορετικά κριτήρια με βάση τα αποσπάσματα των κειμένων που ανακτώνται από την μονάδα QA, και 3) μια μονάδα Λήψης Αποφάσεων (Decision Making) (DM), η οποία επιλέγει τη στρατηγική για το παιχνίδι σύμφωνα με συγκεκριμένους κανόνες, και σύμφωνα με τις βαθμολογίες που αποδίδονται στις υποψήφιες απαντήσεις. Τέλος στην εργασία αξιολογούνται τόσο η ακρίβεια του εικονικού παίκτη να απαντήσει σωστά στις ερωτήσεις του παιχνιδιού, όσο και η ικανότητά του να παίζει πραγματικά παιχνίδια για να κερδίσει χρήματα. Τα πειράματα έχουν διεξαχθεί με ερωτήσεις που προέρχονται από την ελληνική έκδοση του επιτραπέζιου παιχνιδιού. Σε γενικές γραμμές παρατηρείται ότι η μέση ακρίβεια του εικονικού παίκτη είναι σημαντικά καλύτερη από την απόδοση των ανθρώπινων παικτών. Όσον αφορά τη δυνατότητα να παίξει πραγματικά παιχνίδια, το οποίο περιλαμβάνει τον ορισμό μιας κατάλληλης στρατηγικής για τη χρήση των σανίδων σωτηρίας προκειμένου να αποφασίσει είτε να απαντήσει σε μια ερώτηση ακόμη και σε μια κατάσταση αβεβαιότητας ή να αποσυρθεί από το παιχνίδι παίρνοντας τα χρήματα που έχει κερδίσει μέχρι τώρα, ο εικονικός παίκτης κερδίζει κατά μέσο όρο περισσότερα χρήματα από το μέσο ποσό που κέρδισαν οι ανθρώπινοι παίκτες.This work describes in detail the techniques used to build a virtual player for the popular TV game “Who Wants to Be a Millionaire?” and is based on the corresponding article [1] in which the virtual player has been implemented for the English and the Italian versions of the game. Also in this work an attempt was made to apply the various techniques described in the article. The implementation of the virtual player for the Greek version of the game was made using the programming language Java and will be presented in detail. The virtual player must answer a series of multiple-choice questions posed in natural language by selecting the correct answer among four different choices. If he is not sure about an answer he can use the lifelines or quit the game. The architecture of the virtual player consists of 1) a Question Answering (QA) module, which leverages the use of Google search engine to retrieve the most relevant passages of text useful to identify the correct answer to a question, 2) an Answer Scoring (AS) module, which assigns a score to each candidate answer according to different criteria based on the passages of text retrieved by the QA module, and 3) a Decision Making (DM) module, which chooses the strategy for playing the game according to specific rules as well as to the scores assigned to the candidate answers. Finally, in this work both the accuracy of the virtual player to answer correctly the questions of the game, and its ability to play real games in order to earn money are evaluated. The experiments have been conducted with questions derived from the Greek version of the board game. Generally, it is observed that the average accuracy of the virtual player is significantly better that the performance of the human players. Regarding the ability to play real games, which involves the definition of a proper strategy for the usage of lifelines in order to decide whether to answer a question even in a condition of uncertainty or to retire from the game by taking the earned money, the virtual player wins on average more money than the average amount earned by human players
    corecore