22 research outputs found

    Factoid question answering for spoken documents

    Get PDF
    In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic. The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts. The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals. En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència. La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat. Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular. El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009. Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació

    Recuperación de pasajes multilingües para la búsqueda de respuestas

    Full text link
    JAVA Information Retrieval System (JIRS) es un sistema de Recuperación de Información especialmente orientado a tareas de Búsqueda de Respuestas. Los tradicionales motores de búsqueda se basan en las palabras claves de la pregunta para obtener los documentos relevantes a una consulta. JIRS, por el contrario, intenta obtener trozos de texto, es decir pasajes, con mayor probabilidad de contener la respuesta. Para ello realiza una búsqueda basada en los n-gramas de la pregunta -efectuada en lenguaje natural- usando tres posibles modelos. Los modelos de n-gramas desarrollados son independientes del idioma, lo que hace de JIRS un sistema idóneo para trabajar en ambientes multilingües. Por otra parte, JIRS incorpora un potente núcleo que permite una adaptación y escalabilidad sin precedentes en los modernos motores de búsqueda. Desde sus inicios fue diseñado para que fuera una herramienta potente que permitiese adaptarse sin dificultad a muy diferentes funciones. Esto permite ampliar o modificar aspectos de JIRS de forma muy fácil e intuitiva sin que el usuario final tenga que conocer el código desarrollado por otros. Además, permite generar nuevas aplicaciones con una estructura cliente/servidor, distribuida, etc. únicamente modificando el archivo de configuración. Este trabajo presenta el estado del arte de la Recuperación de Información centrándose en la Búsqueda de Respuestas multilingüe, así como una descripción detallada de JIRS junto con sus modelos de búsqueda exponiendo, finalmente, los resultados obtenidos por este sistema en las competiciones del CLEF.Gómez Soriano, JM. (2007). Recuperación de pasajes multilingües para la búsqueda de respuestas [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/1930Palanci

    Interim research assessment 2003-2005 - Computer Science

    Get PDF
    This report primarily serves as a source of information for the 2007 Interim Research Assessment Committee for Computer Science at the three technical universities in the Netherlands. The report also provides information for others interested in our research activities

    Knowledge mining over scientific literature and technical documentation

    Full text link
    Abstract This dissertation focuses on the extraction of information implicitly encoded in domain descriptions (technical terminology and related items) and its usage within a restricted-domain question answering system (QA). Since different variants of the same term can be used to refer to the same domain entity, it is necessary to recognize all possible forms of a given term and structure them, so that they can be used in the question answering process. The knowledge about domain descriptions and their mutual relations is leveraged in an extension to an existing QA system, aimed at the technical maintenance manual of a well-known commercial aircraft. The original version of the QA system did not make use of domain descriptions, which are the novelty introduced by the present work. The explicit treatment of domain descriptions provided considerable gains in terms of efficiency, in particular in the process of analysis of the background document collection. Similar techniques were later applied to another domain (biomedical scientific literature), focusing in particular on protein- protein interactions. This dissertation describes in particular: (1) the extraction of domain specific lexical items which refer to entities of the domain; (2) the detection of relationships (like synonymy and hyponymy) among such items, and their organization into a conceptual structure; (3) their usage within a domain restricted question answering system, in order to facilitate the correct identification of relevant answers to a query; (4) the adaptation of the system to another domain, and extension of the basic hypothesis to tasks other than question answering. Zusammenfassung Das Thema dieser Dissertation ist die Extraktion von Information, welche implizit in technischen Terminologien und ähnlichen Ressourcen enthalten ist, sowie ihre Anwendung in einem Antwortextraktionssystem (AE). Da verschiedene Varianten desselben Terms verwendet werden können, um auf den gleichen Begriff zu verweisen, ist die Erkennung und Strukturierung aller möglichen Formen Voraussetzung für den Einsatz in einem AE-System. Die Kenntnisse über Terme und deren Relationen werden in einem AE System angewandt, welches auf dem Wartungshandbuch eines bekannten Verkehrsflugzeug fokussiert. Die ursprüngliche Version des Systems hatte keine explizite Behandlung von Terminologie. Die explizite Behandlung von Terminologie lieferte eine beachtliche Verbesserung der Effizienz des Systems, insbesondere was die Analyse der zugrundeliegenden Dokumentensammlung betrifft. Ähnliche Methodologien wurden später auf einer anderen Domäne angewandt (biomedizinische Literatur), mit einen besonderen Fokus auf Interaktionen zwischen Proteinen. Diese Dissertation beschreibt insbesondere: (1) die Extraktion der Terminologie (2) die Identifikation der Relationen zwischen Termen (wie z.B. Synonymie und Hyponymie) (3) deren Verwendung in einen AE System (4) die Portierung des Systems auf eine andere Domäne

    Museum, memories and digital stories : A liminal space for human computer interaction.

    Get PDF
    Objects, material or digital, mediate memories: they act as anchors in between temporal notions and relations of past and present. Through those objects of memory, the act of remembering occurs neither completely relived in the mind, nor fully captured in the medium. Essential to personal memories, objects represent also our collective memory and capture our social history.The papers submitted for this PhD by selected publications look at the design of innovative technology that can make remembering more evocative and affective. They look at both museums, where digital and material are combined in an augmented reality, and personal/family contexts, where the home and mundane objects can be technologically enhanced to encapsulate digital memories.The museum was ideal to experiment with hybrid settings that combine material (the collection and the architectural space) and digital (the information) (papers 1 to 3). Personalization of information was used to augment the reality of rooms and exhibits: whole body interaction (i.e. physical movements in the space) was used to select and personalize the content and engage visitors with both material (the object) and digital (the information). Although the mobile technology is dated, these papers show the value of combining digital and physical to provide a holistic experience that made visitors wonder. Where the fusion occurs, however, is in the digital technology. To balance this perspective, paper 4 looks at the effect of taking the digital content out into the exhibition space. My recent research (papers 5-9) looks at objects of memory in the personal realm, in particular in the family home. Starting from observing the role and function of mementos, I conclude that a more holistic and organic approach has to be taken to make personal digital objects of memory more present in people's life. Materialization can be achieved with digital devices designed for individual and family use, so that the product fits with the mundane aspects of life, is immediate, and stimulates affect, not efficiency.Finally papers 10 and 11 provide evidence of the innovative methodologies I have developed and successfully used in iterative user studies and evaluations across different research projects and many years of research. As a whole this submission shows that there is a huge design space to explore in looking at how technology could be used in public or private spaces to bring together the two aspects of memory: remembering in the mind and capturing through objects, in order to preserve our digital life as tangible interactive objects

    Meaning refinement to improve cross-lingual information retrieval

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2012von Farag Ahme

    Representation and Inference for Open-Domain Question Answering: Strength and Limits of two Italian Semantic Lexicons

    Get PDF
    La ricerca descritta nella tesi è stata dedicata alla costruzione di un prototipo di sistema di Question Answering per la lingua italiana. Il prototipo è stato utilizzato come ambiente di valutazione dell’utilità dell’informazione codificata in due lessici semantici computazionali, ItalWordNet e SIMPLE-CLIPS. Il fine è quello di metter in evidenza ipunti di forza e ilimiti della rappresentazione dell’informazione proposta dai due lessici
    corecore