4 research outputs found

    Question answering using document tagging and question classification

    Get PDF
    viii, 139 leaves ; 29 cm.Question answering (QA) is a relatively new area of research. QA is retriecing answers to questions rather than information retrival systems (search engines), which retrieve documents. This means that question answering systems will possibly be the next generation of search engines. What is left to be done to allow QA to be the next generation of search engines? The answer is higher accuracy, which can be achieved by investigating methods of questions answering. I took the approach of designing a question answering system that is based on document tagging and question classification. Question classification extracts useful information from the question about how to answer the question. Document tagging extracts useful information from the documents, which will be used in finding the answer to the question. We used different available systems to tage the documents. Our system classifies the questions using manually developed rules. I also investigated different ways which can use both these methods to answer questions and found that our methods had a comparable accuracy to some systems that use deeper processing techniques. This thesis includes investigations into modules of a question answering system and gives insights into how to go about developing a question answering system based on document tagging and question classification. I also evaluated our current system with the questions from the TREC 2004 question answering track

    Factoid question answering for spoken documents

    Get PDF
    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació

    Processamento automático de línguas naturais : um estudo sobre a localização do IBM Watson™ para o português do Brasil

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Letras, Departamento de Línguas Estrangeiras e Tradução, Línguas Estrangeiras Aplicadas ao Multilinguismo e à Sociedade da Informação, 2015.O Processamento Automático de Línguas Naturais é um domínio de pesquisa promissor quanto à organização de dados digitais na Sociedade da Informação, ao buscar recuperá-los pela língua de registro. Por outro lado, as especificidades de cada língua exigem tratamentos computacionais diferentes e, se não compreendidas em profundidade, podem constituir entraves à produção tecnológica. Esta pesquisa tem por objetivo investigar as particularidades do processamento automático do português brasileiro, comparado ao inglês americano e ao português europeu, a partir de um estudo de caso sobre a localização do sistema de perguntas e respostas Watson da IBM. Inicia-se com um levantamento teórico sobre o estado-da-arte do Processamento de Línguas Naturais, aborda as especificidades da arquitetura DeepQA no tratamento linguístico do sistema e detalha as particularidades do português do Brasil que devem ser observadas na localização do Watson. Da investigação, possíveis adaptações ao sistema são apresentadas, e a importância da inclusão do português brasileiro e, de modo geral, do Brasil na Sociedade da Informação é posta em reflexão.Automatic Natural Language Processing is a promising research domain to the organization of digital data in the Information Society, as it seeks to recover them by the language in which they are registered. Nevertheless, the specificities of each language require different computational treatments and, if not understood in depth, may hinder the technological production. This research aims to investigate the particularities of the automatic processing of Brazilian Portuguese, compared to American English and European Portuguese, from a case study on the localization of IBM Watson question-answering system. It begins with a theoretical survey on the state of the art of Natural Language Processing, addresses the specificities of the DeepQA architecture in processing language for the system and details the particularities of Brazilian Portuguese that have to be regarded in Watson's localization. From the research, possible adaptations to the system are presented, and the importance of including Brazilian Portuguese and, in general, Brazil in the Information Society is brought into consideration

    Retrieving questions and answers in community-based question answering services

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore